# Baltimore City population change over the last 23 years - EDA

Introduction

This notebook analyzes population changes in Baltimore from 2000 to 2023. The goal is to explore and visualize the population decline over the years and check the dataset for any issues like missing values, duplicates, and outliers that may affect the analysis.

In [48]:
import pandas as pd
import numpy as np

import plotly.express as px

In [49]:
# import dataset 
bmore_pop_data = pd.read_csv('../baltimore_population_2000_2023.csv')

In [50]:
# check the info in the dataset
bmore_pop_data

Unnamed: 0,Year,Population,Year on Year Change,Change in Percent
0,2000,648746,-,-
1,2001,638700,-10046,-1.55%
2,2002,630367,-8333,-1.30%
3,2003,623567,-6800,-1.08%
4,2004,614564,-9003,-1.44%
5,2005,610068,-4496,-0.73%
6,2006,607864,-2204,-0.36%
7,2007,606006,-1858,-0.31%
8,2008,603758,-2248,-0.37%
9,2009,601984,-1774,-0.29%


In [51]:
# Check data types of columns
bmore_pop_data.dtypes

Year                    int64
Population             object
Year on Year Change    object
Change in Percent      object
dtype: object

In [52]:
# Rename columns for referencing 
bmore_pop_data = bmore_pop_data.rename(columns={"Year": "year", 
                                                "Population": "population", 
                                                "Year on Year Change": "year_on_year_change", 
                                                "Change in Percent": "change_in_percent"})

## Check for duplicates, missing values and outliers

In [53]:
# Check for duplicates
bmore_pop_data.duplicated().sum()

np.int64(0)

In [54]:
# Check for missing_values 
bmore_pop_data.isnull().sum()

year                   0
population             0
year_on_year_change    0
change_in_percent      0
dtype: int64

In [63]:
bmore_pop_data.describe()

Unnamed: 0,year,population,year_on_year_change,change_in_percent
count,24.0,24.0,24.0,24.0
mean,2011.5,609944.458333,-3479.458333,0.0
std,7.071068,20665.320726,6122.542538,0.0
min,2000.0,565239.0,-11444.0,0.0
25%,2005.75,602926.75,-7506.25,0.0
50%,2011.5,612708.5,-4182.0,0.0
75%,2017.25,622882.0,-863.75,0.0
max,2023.0,648746.0,18958.0,0.0


In [64]:
# Average population and standard deviation

The dataset appears to be in good tact, having no duplicates or null values. I do see "-" values in the first row which I will treat as NaN values and convert them to 0. The outliers in the data can be seen in the 'year_on_year_change', where there w

## Changes to Population column

* Replace non numerical values and convert column values to numeric values for calculations 

In [55]:
# Remove commas and convert values to int
bmore_pop_data['population'] = bmore_pop_data['population'].str.replace(',', '')
bmore_pop_data['population'] = bmore_pop_data['population'].astype('int')
bmore_pop_data['population']

0     648746
1     638700
2     630367
3     623567
4     614564
5     610068
6     607864
7     606006
8     603758
9     601984
10    620942
11    620493
12    623035
13    622591
14    623833
15    622831
16    616542
17    610853
18    603241
19    594601
20    583157
21    576578
22    569107
23    565239
Name: population, dtype: int64

# Changes to Year change column

* Replace non numerical values and convert column values to numeric values for calculations 

In [56]:
# ['year_on_year_change'] remove commas, fill NaN values and convert to int
bmore_pop_data['year_on_year_change'] = bmore_pop_data['year_on_year_change'].str.replace(',', '')
bmore_pop_data['year_on_year_change'] = bmore_pop_data['year_on_year_change'].replace('-', np.nan)
bmore_pop_data['year_on_year_change'] = pd.to_numeric(bmore_pop_data['year_on_year_change'], 
                                                      errors='coerce').fillna(0).astype(int)
bmore_pop_data['year_on_year_change']

0         0
1    -10046
2     -8333
3     -6800
4     -9003
5     -4496
6     -2204
7     -1858
8     -2248
9     -1774
10    18958
11     -449
12     2542
13     -444
14     1242
15    -1002
16    -6289
17    -5689
18    -7612
19    -8640
20   -11444
21    -6579
22    -7471
23    -3868
Name: year_on_year_change, dtype: int64

# Changes to Percent change column

* Replace non numerical values and convert column values to numeric values for calculations 
* convert to percentages to floats and then to decimal form 

In [57]:
# remove "-" from first row 
bmore_pop_data['change_in_percent'] = bmore_pop_data['change_in_percent'].replace('-', np.nan)
bmore_pop_data['change_in_percent'] = pd.to_numeric(bmore_pop_data['change_in_percent'], 
                                                      errors='coerce').fillna(0).astype(int)
bmore_pop_data['change_in_percent']

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
Name: change_in_percent, dtype: int64

In [58]:
# Calculate statistics , avg and total loss since 2000
def calculate_statistics(df):
    avg_pop_loss = int(df['year_on_year_change'].mean())
    total_pop_loss = int(df['population'].max() - df['population'].min())
    return avg_pop_loss, total_pop_loss

avg_pop_loss, total_pop_loss = calculate_statistics(bmore_pop_data)

# Scatter Plot

In [59]:
# Scatterplot using year vs population to show decline in population since 2000
fig = px.scatter(bmore_pop_data, x='year', y='population', title='Change in Baltimore City population by year',
                 labels={'year': 'year', 'population': 'population'})
fig.show()

The chart above shows a significant decrease from 2000-2023. There was a slight boom in population around 2010, then a flatline, followed by another dip in 2015. The total decrease equaling 83,507 people.

# Histogram 

In [60]:
fig2 = px.bar(bmore_pop_data, x='year', y='year_on_year_change', 
             title='Change in Baltimore City population by year', 
             labels={'year_on_year_change': '# of people per year', 'year': 'Year'})

fig2.show()

This chart offers a different perspective, highlighting both the positive and negative population changes over the past 23 years, and illustrating the patterns of inflows vs outflows of people.

In [61]:
bmore_pop_data.dtypes

year                   int64
population             int64
year_on_year_change    int64
change_in_percent      int64
dtype: object

In [62]:


bmore_pop_data['year'].dtype

dtype('int64')