# Data Inspection

The  processs of data inspection involves testing your assumptions about your dataset. This includes validation of data values, verification of data structure, examining data distributions, among other dataset properties.

Anticipating and testing for the ways in which your data could be flawed should be among the first steps in your workflow after data collection. Scripting any corrections made to your dataset will be important for the reproducibility of your findings.

### Install necessary packages

In [None]:
! pip install --user geopy
! pip install --upgrade pandas

### Dataframes:
- Part of the pandas data analysis package (https://pandas.pydata.org/)
- Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

### Load FEC contribution data and import data into dataframe
- This dataset contains individual and organizational contributions to political action committees during 2015 and 2016
- Goal: create dataset of contributions from individuals from the US. Data is to be used in a zip code level analysis of financial support and political affiliation.

In [None]:
donation_df = pd.read_csv('contributions2015_2016.txt', delimiter='|', nrows=5000000)

In [None]:
donation_df.head()

### Create dataframe with only columns of interest

In [None]:
d_df=donation_df[['ENTITY_TP','NAME','CITY','STATE','ZIP_CODE','EMPLOYER','TRANSACTION_AMT','Year','CMTE_PTY_AFFILIATION']]

In [None]:
d_df.head()

### Validate transaction amounts

In [None]:
# Validate for positive transaction amounts
trans_validate = d_df['TRANSACTION_AMT'] < 0
sum(trans_validate)

In [None]:
neg_df = d_df[d_df['TRANSACTION_AMT'] < 0]
neg_df.head()

### Limit dataset to individual contributions for 2016

In [None]:
ind_df = d_df[(d_df['ENTITY_TP'] == 'IND') & (d_df['TRANSACTION_AMT'] > 0) 
                     & (d_df['Year'] == 2016)]
ind_df.head()

### Examine data structure and data types

In [None]:
ind_df.shape

In [None]:
ind_df.info(verbose=True)

### Validate zipcodes with regex and normalize to five digits

In [None]:
# Validate zipcode as five digit number
zip_validate = ind_df['ZIP_CODE'].str.match(r'^\d{5}$').astype(bool)
print(sum(zip_validate))
print(len(zip_validate))

In [None]:
# Non-five digit zipcodes
ext_zip_df = ind_df[ind_df['ZIP_CODE'].str.match(r'^\d{5}$').astype(bool) == False]
ext_zip_df.head()

### Create pandas series with first five characters of zip code field

In [None]:
five_zip = ind_df['ZIP_CODE'].apply(str).str[:5]
five_zip[:9]

In [None]:
ind_df.drop(columns=['ZIP_CODE'])
ind_df['ZIP_CODE'] = five_zip
print(ind_df.shape)
ind_df.head()

In [None]:
# Non-five digit zipcodes
ext_zip_df = ind_df[ind_df['ZIP_CODE'].str.match(r'^\d{5}$').astype(bool) == False]
ext_zip_df.head()

In [None]:
# reduce dataframe to only valid us zip codes
ind_df=ind_df[ind_df['ZIP_CODE'].str.match(r'^\d{5}$').astype(bool)]
ind_df.shape

### Save clean data to file

In [None]:
ind_df.to_csv("fec_clean.csv", index=False)

## Reshaping data

Principles of tidy data organization
(https://vita.had.co.nz/papers/tidy-data.pdf)


Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Python Tidyverse implementation: dplython (https://itsalocke.com/blog/python-and-tidyverse/)

#### Restructuring dataset: Gapminder (https://www.gapminder.org/) GDP per country 1952-2007

In [None]:
data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
print(gapminder.tail(3))

### Select only GDP columns using regex

In [None]:
gdp_df = gapminder.loc[:, gapminder.columns.str.contains('^gdp|^c')]
print(gdp_df.head(n=3))

### Tidy the dataset

In [None]:
tidy_df = gdp_df.melt(id_vars=["continent", "country"], 
                              var_name="year", 
                              value_name="gdp")
tidy_df.head(n=10)

### Normalize year value to digits

In [None]:
years = tidy_df['year'].apply(str).str[10:].apply(int)
tidy_df.drop(columns=['year'])
tidy_df['year'] = years
print(tidy_df.shape)
tidy_df.head()