# Analysis of Income in Maryland
**by Yuan Shen Tay**

## Introduction

Over the years, living cost has been increasing throughout the country which lead to the question on how much income is needed in order to sustain. The living cost varies from state to state and even from county to county due to the difference in housing prices and cost of basic necessities.  

Through my tutorial, I will be looking at the income for each county across Maryland. I will analyze the trend on the income for each county and predict the income for Maryland as a whole. Through my analysis, I will also see if there is any correlation between poverty rates and income. 



## Importing Libraries

Before you start the analysis, we would need to import some libraries that contain tools which we need and will help us carry out the analysis. The libraries used in this project are:  
**pandas** - Pandas has the tools needed for data analysis and manipulation mainly the dataframes   
**numpy** - Numpy is a scientific computing library that we can use on large multidimensional arrays   
**matplotlib** - Matplotlib is a plotting library for us to plot and visualize our data    
**sklearn** - SciKit Learn is a Machine Learning library that large number of models where we can use to classify our data  
**statsmodels** - statsmodels contains functions that can be used to estimate statistical models and conducting tests

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from statsmodels.formula.api import ols

## Data Collection

### Importing Data

The first stage of the data life cycle is collecting data. The dataset used is obtained from Maryland state open data website. The links for the dataset are:  
https://opendata.maryland.gov/Demographic/Maryland-Per-Capita-Personal-Income-Constant-2012-/q4mi-9fr9  
https://opendata.maryland.gov/Planning/Poverty-Rate-With-Margin-Of-Error-2010-2019/iudf-4y2j  
https://opendata.maryland.gov/Demographic/Maryland-Median-Household-Income-By-Year-With-Marg/bvk4-qsxs  

The website already has Application Programming Interface (API) which allows to directly connect with the websites and obtain the csv files which contains the data. The dataset are we are using contains the income per capita, poverty rate and median household income for each county in Maryland.

In [None]:
income_per_capita = pd.read_csv('https://opendata.maryland.gov/resource/q4mi-9fr9.csv')
income_per_capita.head()

In [None]:
poverty_rate = pd.read_csv('https://opendata.maryland.gov/resource/iudf-4y2j.csv')
poverty_rate.head()

In [None]:
median_income = pd.read_csv('https://opendata.maryland.gov/resource/bvk4-qsxs.csv')
median_income.head()

## Data Management

Now that we have our collected our data, the next step will be to tidy up our data which means that we would want to filter out everything which is not used in our analysis and to handle missing entries in our dataset.

### Missing Entries

We would need to check for any missing data in our datasets

In [None]:
income_per_capita.isna().sum()

In [None]:
poverty_rate.isna().sum()

In [None]:
median_income.isna().sum()

Fortunately, since the sum of missing entries is 0 for everything, we have no missing entries on our data. If there were missing entries, we can call the function dropna() to drop all missing entries from our data. However, it is not always the case to handle missing entries by just dropping them. 

### Dropping Unused Data

Next, we will be dropping off rows and columns that are not used. For the rows, we will not be using rows that are marked MOE in the poverty rate and median income tables as they are the margin of error. For the columns, we will only need the year and value of each county. So, we will be dropping all the other columns and setting the years to be the index

In [None]:
# dropping data from the income per capita table
income_per_capita = income_per_capita.drop(columns=['date_created'])
income_per_capita = income_per_capita.set_index('year')
income_per_capita.head()

In [None]:
# dropping data from the poverty rate table
poverty_rate = poverty_rate.loc[poverty_rate['estimate'] == 'Poverty Rate']
poverty_rate = poverty_rate.drop(columns=['date_created', 'estimate'])
poverty_rate = poverty_rate.set_index('year')
poverty_rate.head()

In [None]:
# dropping data from the median income table
median_income = median_income.loc[median_income['data'] == 'Income']
median_income = median_income.drop(columns=['date_created', 'data'])
median_income = median_income.set_index('year')
median_income.head()

### Combining Data

Lastly, I will combine all the county data together to be represented in the same table using a MultiIndex which are the years and counties.

In [None]:
# combining all county data

# getting the years that we are analyzing
years = poverty_rate.index

# getting all the counties in Maryland
counties = income_per_capita.columns
counties = counties[1:]

all_data = pd.DataFrame()
index = [years, counties]
index = pd.MultiIndex.from_product(index, names = ['years', 'county'])
per_capita = []
median = []
poverty = []
for year in years:
    per_capita.extend(income_per_capita.loc[year][1:].values)
    median.extend(median_income.loc[year][1:].values)
    poverty.extend(poverty_rate.loc[year][1:].values)
all_data['income_per_capita'] = per_capita
all_data['median_income'] = median
all_data['poverty_rate'] = poverty
all_data = all_data.set_index(index)
all_data.head()

## Exploratory Data Analysis And Data Visualization

Now that we have tidied up all our data, we are ready to start analyzing and visualize our data which is the next step in the data science pipeline. 

### Analysis of Income Trend

To analyze the income trend, I will be looking at the median household income and income per capita for each county over the years 2010 and 2019 but first, we want to see the trend of income in Maryland as a whole.

In [None]:
# extracting the income per capita and median income for maryland
maryland_per_capita = income_per_capita['maryland']
maryland_median = median_income['maryland']

# setting the size of the graph
plt.figure(figsize=(10,10))
           
# plotting the income graph
plt.plot(years, maryland_per_capita, label="Income Per Capita")
plt.plot(years, maryland_median, label="Median Income")
plt.title('Income Graph of Maryland from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Income')
plt.legend()
plt.show()

As we can see there is an increasing trend in both the income per capita and median income in Maryland as a whole. We can also see that the median income is much higher than the income per capita which makes sense as the total population is taken into consideration for the calculations to find the income per capita. 

Now, we want to visualzie the trend of income for each county in Maryland to see if all counties are having the same trends

In [None]:
# plotting the graph
plt.figure(figsize=(15,10))
for county in counties:
    median = all_data.groupby(['county']).get_group(county)['median_income']
    plt.plot(years, median, marker = 'o', label=county)
plt.title('Median Income Graph of Counties from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Income')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.1), ncol = 5)
plt.show() 

Based on the median graph, we can see that not all counties have the same trend across the years and some counties. In the year 2019, Baltimore City, Somerset County, St. Mary's County and Washington County are showing a decreasing trend. Despite that, all counties do have a net increase in median income compared to 2010.

In [None]:
plt.figure(figsize=(15,10))
for county in counties:
    per_capita = all_data.groupby(['county']).get_group(county)['income_per_capita']
    plt.plot(years, per_capita, marker = 'o', label=county)
plt.title('Income Per Capita Graph of Counties from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Income')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.1), ncol = 5)
plt.show() 

From the income per capita graph of the counties, all counties are showing some sort of increasing trend but at different magnitudes. Hence, due to the difference in trend, we might want to visualize the distribution of each county over the years to see if the different trend and magnitudes have any impact on the income distribution in Maryland.

In [None]:
# plotting a graph for each county
for year in years:
   # extracting the income per capita and median income for the county
    per_capita = all_data.groupby(['years']).get_group(year)['income_per_capita']
    median = all_data.groupby(['years']).get_group(year)['median_income']

    # setting the size of the graph
    plt.figure(figsize=(10,10))

    # plotting the income graph
    plt.plot(counties, per_capita, label="Income Per Capita")
    plt.plot(counties, median, label="Median Income")
    plt.title('Income Graph in ' + str(year))
    plt.xlabel('County')
    plt.xticks(rotation = 90)
    plt.ylabel('Income')
    plt.legend()
    plt.show() 

Looking at the counties side by side, there is not much change in the shape of the graph over the years for both the median income and income per capita. This shows that despite the increase in income over time, the distribution of income across the counties are still the same. Besides that, this also shows that the difference in trends of each county over the years did not have much impact on the distribution of incomes across counties. 

Hence, we can still say that there is an increasing trend of income in all counties in Maryland over the years.

### Analysis of Poverty Rate

First, we would want to look at the trend of poverty rate across years for Maryland.

In [None]:
# extracting the maryland poverty rate
maryland_poverty_rate = poverty_rate['maryland']

# setting the size of the graph
plt.figure(figsize=(10,10))

# plotting the income graph
plt.plot(years, maryland_poverty_rate)
plt.title('Graph of Maryland Poverty Rate from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Poverty Rate')
plt.show()  

The graph shows a decreasing trend across years which suggests that the increase in income might be the reason behind the decrease in poverty rates. Then, we want to look at the poverty rates across county in the year 2019.

In [None]:
# setting the size of the graph
plt.figure(figsize=(10,10))

# plotting the income graph
plt.plot(counties, poverty_rate.loc[years[-1]][1:])
plt.title('Graph of Poverty Rate by County')
plt.xlabel('County')
plt.xticks(rotation=90)
plt.ylabel('Poverty Rate')
plt.show()

The poverty rates seem to have some correlation to income as the counties with higher incomes have lower poverty rates.

## Hypothesis Testing

The next phase in the data science pipeline is to perform modeling techniques such as linear regression, decision trees and k-nearest-neighbor to obtain predictive model of our data. Using the predictive model, we can carry out hypothesis testing. 

### Predicting Maryland State Income

In this part, we will fit a linear regression onto our data and use the equation of the regression to predict future income values. A linear model has the equation,
<center>$Y = mX + c$</center>
where M is the slope and c is the intercept

In [None]:
# getting the year index values into a 2d array
x = maryland_median.index
y = maryland_median

m, c = np.polyfit(x,y, deg=1)

plt.figure(figsize=(10,10))
plt.plot(x, y, 'o', x, m*x + c)
plt.xlabel('Year')
plt.ylabel('Median Income')
plt.title('Predicted Graph of Median Income')
plt.show()

In [None]:
print('The slope, m is ' + str(m) +' and the intercept, c is ' + str(c))

Therefore, our equation is:
<center>$Y = 1933.12X - 3818109.27$</center>

In [None]:
m*2022+c

Using this equation, our predicted median income for this year, 2022 is $\$$90,649.56

### Poverty Rates and Income Correlation

I will be exploring the relationship between poverty rates and income. In order to see if there is a relationship between them, I will be using a linear regression.  
**Null Hypothesis, $H_0$**: There is no relationship between poverty rates and income  
**Alternative Hypothesis, $H_1$**: There is a relationship between poverty rates and income  


In [None]:
# creating our model
per_capita_model = linear_model.LinearRegression()
median_model = linear_model.LinearRegression()

x1 = np.array(all_data['income_per_capita']).reshape(len(all_data['income_per_capita']), 1)
x2 = np.array(all_data['median_income']).reshape(len(all_data['median_income']), 1)
y = np.array(all_data['poverty_rate']).reshape(len(all_data['poverty_rate']), 1)

# fitting the data into our model
per_capita_model.fit(x1, y)
median_model.fit(x2, y)

Now that we have trained our linear models and trained it using the fit() function, we want to visualize the prediction as well as get its results.

In [None]:
plt.figure(figsize=(10,10))
plt.plot(all_data['income_per_capita'], all_data['poverty_rate'], 'o')

# using the model to predict the values
predicted = per_capita_model.predict(x1)
plt.plot(all_data['income_per_capita'], predicted)

plt.xlabel('Income per capita')
plt.ylabel('Poverty Rate')
plt.title('Poverty Rate vs Income Per Capita')
plt.show()

results = ols(formula = 'income_per_capita ~ poverty_rate', data = all_data).fit()
results.summary()

In [None]:
# plotting the model and getting the results
plt.figure(figsize=(10,10))
plt.plot(all_data['median_income'], all_data['poverty_rate'], 'o')

# using the model to predict the values
predicted = median_model.predict(x2)
plt.plot(all_data['median_income'], predicted)

plt.xlabel('Median Income')
plt.ylabel('Poverty Rate')
plt.title('Poverty Rate vs Median Income')
plt.show()

results = ols(formula = 'median_income ~ poverty_rate', data = all_data).fit()
results.summary()

To understand how well the predicted line fits our data, we can use the coefficient of determination, $R^2$ which tells us on a scale of 0 to 1 how well the regression fits the data. The $R^2$ score for poverty rate against income per capita is around $0.523$ and the score for poverty rate against median income is around $0.745$. The score for median income against poverty rate is relatively okay. 

Next, we would have to look at the p-value which us whether our null hypothesis is statistically significant. Typically a p-value of $0.05$ is used and if the p-value found is less than that, we would reject the null hypothesis. The p-value found for both our models are $0$ which means that it is nigh impossible for both the pairs to exist given income has no effect on poverty rate. Hence, we will reject our null hypothesis in favor of the alternative hypothesis which is that there is statistically significant evidence that income has effect on poverty rates. Based on the graph, we can see that income inversely impacts the poverty rates.

## Insights Attained

As a result of our analysis, we confirmed that the trend of income is increasing over time in Maryland. Despite the increase in income, each county in Maryland experiences a different trend in their change of income where it may increase or decrease over time but will eventually lead to a higher income. However, the difference in trend does not impact the income distribution in Maryland which means that counties with higher incomes still remains as is and counties with lower incomes also remains. 

Next off we concluded that there is a correlation between income and poverty rate where the higher the income, the lower the poverty rate of the county. However, since the distribution of incomes of counties in Maryland remains the same, the poverty rate distribution also remains unchanged. 

In the future, I would like to further extend my findings on increasing income to whether the increase in income can keep up with the increase of living expenses over the years. Also, I would like to further extend my findings to the scale of the whole US rather than just limiting myself to Maryland.