# Baltimore City population change over the last 23 years - EDA

Introduction

This notebook analyzes population changes in Baltimore from 2000 to 2023. The goal is to explore and visualize the population decline over the years and check the dataset for any issues like missing values, duplicates, and outliers that may affect the analysis.

In [102]:
import pandas as pd
import numpy as np

import plotly.express as px

In [103]:
# import dataset 
bmore_pop_data = pd.read_csv('../baltimore_population_2000_2023.csv')

# Analyzing the data 

In [104]:
# check the info in the dataset
bmore_pop_data

Unnamed: 0,Year,Population,Year on Year Change,Change in Percent
0,2000,648746,-,-
1,2001,638700,-10046,-1.55%
2,2002,630367,-8333,-1.30%
3,2003,623567,-6800,-1.08%
4,2004,614564,-9003,-1.44%
5,2005,610068,-4496,-0.73%
6,2006,607864,-2204,-0.36%
7,2007,606006,-1858,-0.31%
8,2008,603758,-2248,-0.37%
9,2009,601984,-1774,-0.29%


In [105]:
# Check data types of columns
bmore_pop_data.dtypes

Year                    int64
Population             object
Year on Year Change    object
Change in Percent      object
dtype: object

In [106]:
# Rename columns for referencing 
bmore_pop_data = bmore_pop_data.rename(columns={"Year": "year", 
                                                "Population": "population", 
                                                "Year on Year Change": "year_on_year_change", 
                                                "Change in Percent": "change_in_percent"})

In [118]:
bmore_pop_data.describe()

Unnamed: 0,year,population,year_on_year_change
count,24.0,24.0,24.0
mean,2011.5,609944.458333,-3479.458333
std,7.071068,20665.320726,6122.542538
min,2000.0,565239.0,-11444.0
25%,2005.75,602926.75,-7506.25
50%,2011.5,612708.5,-4182.0
75%,2017.25,622882.0,-863.75
max,2023.0,648746.0,18958.0


## Check for duplicates and missing values 

In [107]:
# Check for duplicates
int(bmore_pop_data.duplicated().sum())

0

In [108]:
# Check for missing_values 
bmore_pop_data.isnull().sum()

year                   0
population             0
year_on_year_change    0
change_in_percent      0
dtype: int64

The dataset appears to be in good tact, having no duplicates or null values. I do see "-" values in the first row which I will treat as NaN values and convert them to 0. 

## Changes to Population column

* Replace non numerical values and convert column values to numeric values for calculations 

In [110]:
# Remove commas and convert values to int
bmore_pop_data['population'] = bmore_pop_data['population'].str.replace(',', '')
bmore_pop_data['population'] = bmore_pop_data['population'].astype('int')
bmore_pop_data['population'].head()

0    648746
1    638700
2    630367
3    623567
4    614564
Name: population, dtype: int64

# Changes to Year change column

* Replace non numerical values and convert column values to numeric values for calculations 

In [111]:
# ['year_on_year_change'] remove commas, fill NaN values and convert to int
bmore_pop_data['year_on_year_change'] = bmore_pop_data['year_on_year_change'].str.replace(',', '')
bmore_pop_data['year_on_year_change'] = bmore_pop_data['year_on_year_change'].replace('-', np.nan)
bmore_pop_data['year_on_year_change'] = pd.to_numeric(bmore_pop_data['year_on_year_change'], 
                                                      errors='coerce').fillna(0).astype(int)
bmore_pop_data['year_on_year_change'].head()

0        0
1   -10046
2    -8333
3    -6800
4    -9003
Name: year_on_year_change, dtype: int64

# Changes to Percent change column

* Replace non numerical values and convert column values to numeric values for calculations 
* convert to percentages to floats and then to decimal form 

In [112]:
# remove "-" from first row 
bmore_pop_data['change_in_percent'] = bmore_pop_data['change_in_percent'].replace('-', np.nan)
bmore_pop_data['change_in_percent'] = bmore_pop_data['change_in_percent'].fillna(0)
bmore_pop_data['change_in_percent'].head()

0         0
1    -1.55%
2    -1.30%
3    -1.08%
4    -1.44%
Name: change_in_percent, dtype: object

In [129]:
# Calculate statistics , avg and total loss, avg population, std of population and  max gain and max loss of pop
def calculate_statistics(df):
    avg_pop_loss = int(df['year_on_year_change'].mean())
    total_pop_loss = int(df['population'].max() - df['population'].min())
    avg_pop = int(bmore_pop_data['population'].mean())
    std_pop = int(bmore_pop_data['population'].std())
    max_gain = int(bmore_pop_data['year_on_year_change'].max())
    max_loss = int(bmore_pop_data['year_on_year_change'].min())
    return avg_pop_loss, total_pop_loss, avg_pop, std_pop, max_gain, max_loss

avg_pop_loss, total_pop_loss, avg_pop, std_pop, max_gain, max_loss = calculate_statistics(bmore_pop_data)
year_of_max_gain = bmore_pop_data.loc[bmore_pop_data['year_on_year_change'].idxmax(), 'year']
year_of_max_loss = bmore_pop_data.loc[bmore_pop_data['year_on_year_change'].idxmin(), 'year']

display(f"Average population loss per year: {avg_pop_loss}")
display(f"Total population from 2000-2023: {total_pop_loss}")
display(f"Average population from 2000-2023: {avg_pop}")
display(f"Standard deviation of population from 2000-2023: {std_pop}")
display(f"The max loss in one year was: {max_loss} in {year_of_max_loss}")
display(f"The max gain in one year was: {max_gain} in {year_of_max_gain} ")


'Average population loss per year: -3479'

'Total population from 2000-2023: 83507'

'Average population from 2000-2023: 609944'

'Standard deviation of population from 2000-2023: 20665'

'The max loss in one year was: -11444 in 2020'

'The max gain in one year was: 18958 in 2010 '

The average annual population loss highlights the overall downward trend. The total loss illustrates the scale of this decline. Meanwhile, the standard deviation indicates that, despite the general decrease, there were notable fluctuations in the population, such as the significant spike in 2010 and major loss in 2020. These two years definitely serve as outliers in the dataset. 

# Scatter Plot

In [114]:
# Scatterplot using year vs population to show decline in population since 2000
fig = px.scatter(bmore_pop_data, x='year', y='population', title='Change in Baltimore City population by year',
                 labels={'year': 'year', 'population': 'population'})
fig.show()

The chart above visualizes the decline from 2000-2023. 

# Histogram 

In [115]:
fig2 = px.bar(bmore_pop_data, x='year', y='year_on_year_change', 
             title='Change in Baltimore City population by year', 
             labels={'year_on_year_change': '# of people per year', 'year': 'Year'})

fig2.show()

This chart offers a different perspective, giving a clear view of the inflows and outflows of the population. You're able to clearly see the spike in 2010 and dip in 2020 here.

## Conclusion

The overall trend for Baltimore has been a steady decline in population, reflecting broader issues that may include economic factors, migration, and urban development challenges. The year 2010 stands out as a major outlier, with a large population gain that breaks the long-term trend, but it was short-lived, as the decline continued after 2015. The year 2020 stands out not just as a continuation of the ongoing decline but as a record-setting drop, possibly influenced by a combination of local and global events. If this decline reflects broader trends, further investigation could help determine whether this sharp dip was an anomaly or the beginning of an accelerating decline. 