#### 3. Subsetting & Merging

### This script contains the following:
1. Importing libraries and data
2. Subsetting dataframes
3. Merging dataframes
4. Exporting merged dataframe

### 1. Importing libraries and data

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
path = r'C:\Users\Neena Tilton\Dropbox\Projects\MinWage_Crime'

In [3]:
# Import dataframe for Min Wage

df_mw = pd.read_pickle(os.path.join(path, '02_Data', 'PreparedData', 'MinWage_wrangled.pkl'))

In [4]:
# Import dataframe for Crime

df_crime = pd.read_pickle(os.path.join(path, '02_Data', 'PreparedData', 'crimes_wrangled.pkl'))

In [5]:
df_mw.head()

Unnamed: 0,Year,State,state_mw,state_mw_2020,fed_mw,fed_mw_2020,effective_mw,effective_mw_2020
0,1968,Alabama,0.0,0.0,1.15,8.55,1.15,8.55
1,1968,Alaska,2.1,15.61,1.15,8.55,2.1,15.61
2,1968,Arizona,0.468,3.48,1.15,8.55,1.15,8.55
3,1968,Arkansas,0.15625,1.16,1.15,8.55,1.15,8.55
4,1968,California,1.65,12.26,1.15,8.55,1.65,12.26


In [6]:
df_crime.head()

Unnamed: 0,state,year,prisoner_count,state_population,violent_crime,murder,robbery,burglary
0,FEDERAL,2001,149852.0,,,,,
1,ALABAMA,2001,24741.0,4468912.0,19582.0,379.0,5584.0,40642.0
2,ALASKA,2001,4570.0,633630.0,3735.0,39.0,514.0,3847.0
3,ARIZONA,2001,27710.0,5306966.0,28675.0,400.0,8868.0,54821.0
4,ARKANSAS,2001,11489.0,2694698.0,12190.0,148.0,2181.0,22196.0


### 2. Subsetting Dataframe

After some consideration, decided that the rows for states named 'FEDERAL' is not needed of the Crimes df. Create a subset removing all 'FEDERAL' rows.

In [9]:
df_crime_2 = df_crime.loc[df_crime['state'] != 'FEDERAL']

In [10]:
df_crime_2['state'].value_counts(dropna = False)

ALABAMA           16
PENNSYLVANIA      16
NEVADA            16
NEW HAMPSHIRE     16
NEW JERSEY        16
NEW MEXICO        16
NEW YORK          16
NORTH CAROLINA    16
NORTH DAKOTA      16
OHIO              16
OKLAHOMA          16
OREGON            16
RHODE ISLAND      16
MONTANA           16
SOUTH CAROLINA    16
SOUTH DAKOTA      16
TENNESSEE         16
TEXAS             16
UTAH              16
VERMONT           16
VIRGINIA          16
WASHINGTON        16
WEST VIRGINIA     16
WISCONSIN         16
NEBRASKA          16
MISSOURI          16
ALASKA            16
MISSISSIPPI       16
ARIZONA           16
ARKANSAS          16
CALIFORNIA        16
COLORADO          16
CONNECTICUT       16
DELAWARE          16
FLORIDA           16
GEORGIA           16
HAWAII            16
IDAHO             16
ILLINOIS          16
INDIANA           16
IOWA              16
KANSAS            16
KENTUCKY          16
LOUISIANA         16
MAINE             16
MARYLAND          16
MASSACHUSETTS     16
MICHIGAN     

In [11]:
df_crime_2.isnull().sum()

state               0
year                0
prisoner_count      0
state_population    0
violent_crime       0
murder              0
robbery             0
burglary            0
dtype: int64

As for Min Wage dataframe, the 'Year' ranges between 1968 to 2020. The Crime data frame ranges between 2001 to 2016. Create a subset of Min Wage dataframe ranging only between 2001 to 2016. 

In [14]:
df_mw_2 = df_mw.loc[(df_mw['Year'] >= 2001) & (df_mw['Year'] <= 2016)]

In [15]:
df_mw_2['Year'].value_counts(dropna = False)

2001    54
2002    54
2003    54
2004    54
2005    54
2006    54
2007    54
2008    54
2009    54
2010    54
2011    54
2012    54
2013    54
2014    54
2015    54
2016    54
Name: Year, dtype: int64

### 3. Merging Dataframes

The Crimes dataframe names for 'state' are in all caps. Update to be in proper title format to successfully merge the two dataframes. 

In [16]:
df_crime_2['state'] = df_crime_2['state'].str.title()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_crime_2['state'] = df_crime_2['state'].str.title()


In [17]:
df_crime_2['state'].value_counts(dropna = False)

Alabama           16
Pennsylvania      16
Nevada            16
New Hampshire     16
New Jersey        16
New Mexico        16
New York          16
North Carolina    16
North Dakota      16
Ohio              16
Oklahoma          16
Oregon            16
Rhode Island      16
Alaska            16
South Carolina    16
South Dakota      16
Tennessee         16
Texas             16
Utah              16
Vermont           16
Virginia          16
Washington        16
West Virginia     16
Wisconsin         16
Nebraska          16
Montana           16
Missouri          16
Mississippi       16
Arizona           16
Arkansas          16
California        16
Colorado          16
Connecticut       16
Delaware          16
Florida           16
Georgia           16
Hawaii            16
Idaho             16
Illinois          16
Indiana           16
Iowa              16
Kansas            16
Kentucky          16
Louisiana         16
Maine             16
Maryland          16
Massachusetts     16
Michigan     

Now time to merge the Crimes df to Min Wage df.

In [22]:
# Before we do so, rename column names to match each other.

dict = {'state':'State', 'year':'Year'}

In [23]:
df_crime_2.rename(columns = dict, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [24]:
df_crime_2.head()

Unnamed: 0,State,Year,prisoner_count,state_population,violent_crime,murder,robbery,burglary
1,Alabama,2001,24741.0,4468912.0,19582.0,379.0,5584.0,40642.0
2,Alaska,2001,4570.0,633630.0,3735.0,39.0,514.0,3847.0
3,Arizona,2001,27710.0,5306966.0,28675.0,400.0,8868.0,54821.0
4,Arkansas,2001,11489.0,2694698.0,12190.0,148.0,2181.0,22196.0
5,California,2001,157142.0,34600464.0,212867.0,2206.0,64614.0,232273.0


In [27]:
df_merged = df_mw_2.merge(df_crime_2, on = ['State','Year'], how = 'inner', indicator = True)

In [28]:
df_merged.head()

Unnamed: 0,Year,State,state_mw,state_mw_2020,fed_mw,fed_mw_2020,effective_mw,effective_mw_2020,prisoner_count,state_population,violent_crime,murder,robbery,burglary,_merge
0,2001,Alabama,0.0,0.0,5.15,7.52,5.15,7.52,24741.0,4468912.0,19582.0,379.0,5584.0,40642.0,both
1,2001,Alaska,5.65,8.25,5.15,7.52,5.65,8.25,4570.0,633630.0,3735.0,39.0,514.0,3847.0,both
2,2001,Arizona,0.0,0.0,5.15,7.52,5.15,7.52,27710.0,5306966.0,28675.0,400.0,8868.0,54821.0,both
3,2001,Arkansas,5.15,7.52,5.15,7.52,5.15,7.52,11489.0,2694698.0,12190.0,148.0,2181.0,22196.0,both
4,2001,California,6.25,9.13,5.15,7.52,6.25,9.13,157142.0,34600464.0,212867.0,2206.0,64614.0,232273.0,both


In [29]:
df_merged['_merge'].value_counts()

both          800
left_only       0
right_only      0
Name: _merge, dtype: int64

In [30]:
df_merged.shape

(800, 15)

In [31]:
# Now that we've confirmed that the merged df is a full merge, remove the '_merge' flag column.

df_merged = df_merged.drop(columns = ['_merge'])

In [32]:
df_merged.shape

(800, 14)

In [33]:
df_merged.columns

Index(['Year', 'State', 'state_mw', 'state_mw_2020', 'fed_mw', 'fed_mw_2020',
       'effective_mw', 'effective_mw_2020', 'prisoner_count',
       'state_population', 'violent_crime', 'murder', 'robbery', 'burglary'],
      dtype='object')

### 4. Export merged dataframe

In [34]:
df_merged.to_pickle(os.path.join(path, '02_Data','PreparedData', 'df_merged.pkl'))