# Analysis of Income in Maryland
**by Yuan Shen Tay**

## Introduction

Over the years, living cost has been increasing throughout the country which lead to the question on how much income is needed in order to sustain. The living cost varies from state to state and from county to county due to the difference in housing prices and cost of basic necessities.  

Through my tutorial, I will be looking at the income for each county across Maryland. I will analyze the trend on the income for each county and see if the minimum wage is succifient for Maryland. Through my analysis, I will also see if poverty rates have any correlation to income. 



## Importing Libraries

Before you start the analysis, we would need to import some libraries that contain tools which we need and will help us carry out the analysis. The libraries used in this project are:  
**requests** - Requests allows us to make HTTP requests easier   
**pandas** - Pandas has the tools needed for data analysis and manipulation mainly the dataframes   
**numpy** - Numpy is a scientific computing library that we can use on large multidimensional arrays   
**matplotlib** - Matplotlib is a plotting library for us to plot and visualize our data   
**BeautifulSoup** - BeautifulSoup contains the tools we need to parse html data    
**sklearn** - SciKit Learn is a Machine Learning library that large number of models where we can use to classify our data

In [48]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from sklearn import linear_model

## Data Collection

### Importing Data

The first stage of the data life cycle is collecting data. The dataset used is obtained from Maryland state open data website. The links for the dataset are:  
https://opendata.maryland.gov/Demographic/Maryland-Per-Capita-Personal-Income-Constant-2012-/q4mi-9fr9  
https://opendata.maryland.gov/Planning/Poverty-Rate-With-Margin-Of-Error-2010-2019/iudf-4y2j  
https://opendata.maryland.gov/Demographic/Maryland-Median-Household-Income-By-Year-With-Marg/bvk4-qsxs  

The website already has Application Programming Interface (API) which allows to directly connect with the websites and obtain the csv files which contains the data. The dataset are we are using contains the income per capita, poverty rate and median household income for each county in Maryland.

In [49]:
income_per_capita = pd.read_csv('https://opendata.maryland.gov/resource/q4mi-9fr9.csv')
income_per_capita.head()

Unnamed: 0,date_created,year,maryland,allegany_county,anne_arundel_county,baltimore_city,baltimore_county,calvert_county,caroline_county,carroll_county,...,kent_county,montgomery_county,prince_george_s_county,queen_anne_s_county,somerset_county,st_mary_s_county,talbot_county,washington_county,wicomico_county,worcester_county
0,"September 29, 2020",2010,52251,33436,55360,39699,51519,52421,36012,50397,...,47491,74028,42782,51842,26854,49021,59672,38740,36230,46449
1,"September 29, 2020",2011,53432,33891,56884,40923,51530,53383,37370,51574,...,48498,76529,43336,53009,26897,49897,60929,39419,36134,47180
2,"September 29, 2020",2012,53547,33946,57182,40744,51982,53326,38773,51859,...,48590,76901,42842,53617,26830,49300,60548,39822,35419,48977
3,"September 29, 2020",2013,52352,34049,56537,41156,51151,52177,39330,51601,...,48917,72577,42140,53209,27756,48499,60864,39646,35649,48894
4,"September 29, 2020",2014,53170,34808,57551,42857,52331,52948,39790,52960,...,50625,72746,42425,54075,28881,49133,62278,40548,37043,49840


In [50]:
poverty_rate = pd.read_csv('https://opendata.maryland.gov/resource/iudf-4y2j.csv')
poverty_rate.head()

Unnamed: 0,date_created,year,estimate,maryland,allegany_county,anne_arundel_county,baltimore_city,baltimore_county,calvert_county,caroline_county,...,kent_county,montgomery_county,prince_george_s_county,queen_anne_s_county,somerset_county,st_mary_s_county,talbot_county,washington_county,wicomico_county,worcester_county
0,2020-09-29T00:00:00.000,2010,Poverty Rate,9.9,17.1,6.6,24.7,8.2,6.2,13.0,...,14.2,7.5,9.4,7.3,19.3,7.5,9.7,11.4,16.6,10.6
1,2020-09-29T00:00:00.000,2010,MOE,0.3,3.0,1.1,1.8,1.2,1.4,2.8,...,3.1,0.8,1.2,1.7,5.5,1.9,2.2,2.1,2.8,2.7
2,2020-09-29T00:00:00.000,2011,Poverty Rate,10.2,19.1,6.5,24.5,9.6,6.1,13.1,...,13.9,6.7,9.4,8.7,26.2,8.6,10.8,11.8,17.7,13.0
3,2020-09-29T00:00:00.000,2011,MOE,0.3,3.4,1.2,1.7,1.2,1.5,2.7,...,3.4,0.8,1.0,1.7,6.0,1.8,2.2,1.7,2.6,2.6
4,2022-04-08T00:00:00.000,2012,Poverty Rate,10.4,18.1,6.3,24.5,9.7,7.0,15.7,...,14.0,6.6,10.3,8.2,29.6,8.4,9.7,13.7,16.7,11.1


In [51]:
median_income = pd.read_csv('https://opendata.maryland.gov/resource/bvk4-qsxs.csv')
median_income.head()

Unnamed: 0,date_created,year,data,maryland,allegany_county,anne_arundel_county,baltimore_city,baltimore_county,calvert_county,caroline_county,...,kent_county,montgomery_county,prince_george_s_county,queen_anne_s_county,somerset_county,st_mary_s_county,talbot_county,washington_county,wicomico_county,worcester_county
0,"September 29, 2020",2010,Income,68933,37083,80908,38186,62300,86536,55480,...,49017,88559,69524,78503,38134,81559,56806,51610,47702,55492
1,"September 29, 2020",2010,MOE,833,2826,2311,1414,2006,5064,2965,...,4582,2710,1609,5181,2747,5070,3948,3327,3097,3507
2,"September 29, 2020",2011,Income,70075,38504,82980,38478,62309,88406,50809,...,49795,92288,70114,75158,35426,80943,55145,52028,45788,48472
3,"September 29, 2020",2011,MOE,760,2693,3430,1536,1728,4369,4213,...,4603,2758,1911,6363,3426,2717,4929,2928,3582,4653
4,"September 29, 2020",2012,Income,71169,38670,87083,39077,62413,87215,48772,...,49969,94365,69258,79012,34454,85478,61529,52604,50204,55875


### Tidying Up Data

First, we would need to check for any missing data in our datasets

In [52]:
income_per_capita.isna().sum()

date_created              0
year                      0
maryland                  0
allegany_county           0
anne_arundel_county       0
baltimore_city            0
baltimore_county          0
calvert_county            0
caroline_county           0
carroll_county            0
cecil_county              0
charles_county            0
dorchester_county         0
frederick_county          0
garrett_county            0
harford_county            0
howard_county             0
kent_county               0
montgomery_county         0
prince_george_s_county    0
queen_anne_s_county       0
somerset_county           0
st_mary_s_county          0
talbot_county             0
washington_county         0
wicomico_county           0
worcester_county          0
dtype: int64

In [53]:
poverty_rate.isna().sum()

date_created              0
year                      0
estimate                  0
maryland                  0
allegany_county           0
anne_arundel_county       0
baltimore_city            0
baltimore_county          0
calvert_county            0
caroline_county           0
carroll_county            0
cecil_county              0
charles_county            0
dorchester_county         0
frederick_county          0
garrett_county            0
harford_county            0
howard_county             0
kent_county               0
montgomery_county         0
prince_george_s_county    0
queen_anne_s_county       0
somerset_county           0
st_mary_s_county          0
talbot_county             0
washington_county         0
wicomico_county           0
worcester_county          0
dtype: int64

In [54]:
median_income.isna().sum()

date_created              0
year                      0
data                      0
maryland                  0
allegany_county           0
anne_arundel_county       0
baltimore_city            0
baltimore_county          0
calvert_county            0
caroline_county           0
carroll_county            0
cecil_county              0
charles_county            0
dorchester_county         0
frederick_county          0
garrett_county            0
harford_county            0
howard_county             0
kent_county               0
montgomery_county         0
prince_george_s_county    0
queen_anne_s_county       0
somerset_county           0
st_mary_s_county          0
talbot_county             0
washington_county         0
wicomico_county           0
worcester_county          0
dtype: int64

Fortunately, since the sum of missing entries is 0 for everything, we have no missing entries on our data. If there were missing entries, we can call the function dropna() to drop all missing entries from our data. However, it is not always the case to handle missing entries by just dropping them. 

Next, we will be dropping off rows and columns that are not used. For the rows, we will not be using rows that are marked MOE in the poverty rate and median income tables as they are the margin of error. For the columns, we will only need the year and value of each county. So, we will be dropping all the other columns.

In [55]:
# dropping data from the income per capita table
income_per_capita = income_per_capita.drop(columns=['date_created'])
income_per_capita.head()

Unnamed: 0,year,maryland,allegany_county,anne_arundel_county,baltimore_city,baltimore_county,calvert_county,caroline_county,carroll_county,cecil_county,...,kent_county,montgomery_county,prince_george_s_county,queen_anne_s_county,somerset_county,st_mary_s_county,talbot_county,washington_county,wicomico_county,worcester_county
0,2010,52251,33436,55360,39699,51519,52421,36012,50397,39607,...,47491,74028,42782,51842,26854,49021,59672,38740,36230,46449
1,2011,53432,33891,56884,40923,51530,53383,37370,51574,40235,...,48498,76529,43336,53009,26897,49897,60929,39419,36134,47180
2,2012,53547,33946,57182,40744,51982,53326,38773,51859,40299,...,48590,76901,42842,53617,26830,49300,60548,39822,35419,48977
3,2013,52352,34049,56537,41156,51151,52177,39330,51601,40262,...,48917,72577,42140,53209,27756,48499,60864,39646,35649,48894
4,2014,53170,34808,57551,42857,52331,52948,39790,52960,40944,...,50625,72746,42425,54075,28881,49133,62278,40548,37043,49840


In [56]:
# dropping data from the poverty rate table
poverty_rate = poverty_rate.loc[poverty_rate['estimate'] == 'Poverty Rate']
poverty_rate = poverty_rate.drop(columns=['date_created', 'estimate'])
poverty_rate.reset_index(drop=True, inplace=True)
poverty_rate.head()

Unnamed: 0,year,maryland,allegany_county,anne_arundel_county,baltimore_city,baltimore_county,calvert_county,caroline_county,carroll_county,cecil_county,...,kent_county,montgomery_county,prince_george_s_county,queen_anne_s_county,somerset_county,st_mary_s_county,talbot_county,washington_county,wicomico_county,worcester_county
0,2010,9.9,17.1,6.6,24.7,8.2,6.2,13.0,5.4,10.5,...,14.2,7.5,9.4,7.3,19.3,7.5,9.7,11.4,16.6,10.6
1,2011,10.2,19.1,6.5,24.5,9.6,6.1,13.1,5.5,9.7,...,13.9,6.7,9.4,8.7,26.2,8.6,10.8,11.8,17.7,13.0
2,2012,10.4,18.1,6.3,24.5,9.7,7.0,15.7,6.3,11.9,...,14.0,6.6,10.3,8.2,29.6,8.4,9.7,13.7,16.7,11.1
3,2013,10.2,18.6,7.3,22.7,9.5,6.9,16.7,6.8,9.8,...,14.9,7.0,9.9,8.4,28.5,8.2,10.9,12.0,16.5,13.1
4,2014,10.4,18.5,6.7,23.3,9.8,7.2,16.0,5.9,10.6,...,13.8,7.2,10.3,7.5,25.5,8.6,11.7,13.8,16.9,11.9


In [57]:
# dropping data from the median income table
median_income = median_income.loc[median_income['data'] == 'Income']
median_income = median_income.drop(columns=['date_created', 'data'])
median_income.reset_index(drop=True, inplace=True)
median_income.head()

Unnamed: 0,year,maryland,allegany_county,anne_arundel_county,baltimore_city,baltimore_county,calvert_county,caroline_county,carroll_county,cecil_county,...,kent_county,montgomery_county,prince_george_s_county,queen_anne_s_county,somerset_county,st_mary_s_county,talbot_county,washington_county,wicomico_county,worcester_county
0,2010,68933,37083,80908,38186,62300,86536,55480,80291,61506,...,49017,88559,69524,78503,38134,81559,56806,51610,47702,55492
1,2011,70075,38504,82980,38478,62309,88406,50809,82553,61191,...,49795,92288,70114,75158,35426,80943,55145,52028,45788,48472
2,2012,71169,38670,87083,39077,62413,87215,48772,79304,62443,...,49969,94365,69258,79012,34454,85478,61529,52604,50204,55875
3,2013,72482,39994,85685,41988,64624,91993,46015,82073,64880,...,55695,97873,71682,80143,36106,78274,57525,55643,47536,52276
4,2014,73851,39808,86654,41895,67766,92446,49573,84500,62198,...,53288,97279,71904,80650,38376,84686,54836,54606,51927,55691


## Exploratory Data Analysis And Data Visualization

Now that we have collected all our data, we are ready to start analyzing and visualize our data which is the next step in the data science pipeline. 

### Analysis of Income Trend

To analyze the income trend, I will need first check to see

### Analysis of Poverty Rate