# Notebook Title

[Feature Engineering](#Feature Engineering)

[Visualizing data](#Visualizing Data)

[Splitting Data](#Splitting Data)

[Modeling Data](#Modeling Data)

[Model Validation](#Model Validation)

## Feature Engineering

<a id='Feature Engineering'></a>

In [1]:
import numpy as np#Math library
import pandas as pd#Table library
import matplotlib.pyplot as plt#Plotting library
import pandas as pd
import warnings
import seaborn as sns#Plotting library
warnings.filterwarnings('ignore')#Gets rid of popup warnings
%matplotlib nbagg

In [19]:
data= pd.read_csv('Consumer_Complaints_Search_Overdraft.csv')
data.head(3)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,11/24/2014,Prepaid card,Other special purpose card,"Overdraft, savings or rewards features",,,,Citibank,FL,33612,,,Phone,12/01/2014,Closed with monetary relief,Yes,Yes,1129509
1,11/25/2014,Prepaid card,General purpose card,"Overdraft, savings or rewards features",,,,"NetSpend Corporation, a TSYS Company",CA,90018,Older American,,Web,12/03/2014,Closed with explanation,Yes,No,1131773
2,03/24/2015,Payday loan,Payday loan,Can't stop charges to bank account,Can't stop charges to bank account,"I took out a Loan from Cash Central XXXX, Al f...",Company chooses not to provide a public response,"Community Choice Financial, Inc.",AL,351XX,Older American,Consent provided,Web,03/24/2015,Closed with explanation,Yes,No,1299258


In [21]:
data.shape

(2190, 18)

Since 'Consumer complaint narrative' is really and can be expected to sway banks' responses, let's make a new feature that counts the number of words as a feature. Let's also group states by a  [region](http://www.census.gov/econ/census/help/geography/regions_and_divisions.html "link") of the country to limit the number of features we hot-encode.

In [85]:
data['consumer_narr_word_len'] = data['Consumer complaint narrative'].map(lambda x: len((str(x)).split()))

In [98]:
#Define regions
northeast = ['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY','PA'] 
midwest = ['IL', 'IN', 'MI', 'OH', 'WI','IA', 'KS', 'MN', 'MO', 
           'NE', 'ND', 'SD']
south = ['DE', 'DC', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'WV', 'AL', 
         'KY', 'MS', 'TN', 'AR', 'LA', 'OK','TX']
west = ['AZ', 'CO', 'ID', 'MT', 'NV', 'NM', 'UT', 'WY', 'AK', 'CA', 
        'HI', 'OR', 'WA']

#Initialize state and region lists
state= data['State'].values
region= []

#Assign regions to state
for i in state:
    if i in northeast: region.append('northeast')
    elif i in midwest: region.append('midwest')
    elif i in south: region.append('south')
    elif i in west: region.append('west')
    else: region.append(None)

#Make region new column in dataframe
data['region']= region

In [99]:
data['region'].unique()

array(['south', 'west', 'northeast', 'midwest', None], dtype=object)

Now that we have these, let's see what columns we'll keep in our analysis, which columns we'll turn into dummy variables and which columns will just be dropped out altogether.

In [100]:
names= list(data.columns)

In [102]:
#Counts unique entries for each column
for i in range(0,len(names)):
    print names[i], ': ', len(data[names[i]].unique())

Date received :  549
Product :  11
Sub-product :  36
Issue :  65
Sub-issue :  36
Consumer complaint narrative :  2158
Company public response :  10
Company :  172
State :  54
ZIP code :  550
Tags :  4
Consumer consent provided? :  3
Submitted via :  3
Date sent to company :  541
Company response to consumer :  5
Timely response? :  2
Consumer disputed? :  3
Complaint ID :  2190
consumer_narr_word_len :  642
region :  5


We determined that 'Consumer Narr' was too long and summarized it with a word count so we can eliminate it now. We'll also delete the dates cause they can be summarized by 'Timely response', and 'State' and 'Zip code' cause we now have region information. 'Sub-issue' only appears for a fraction of our data so we'll remove it too. For the remaining columns, we'll keep them if they're continuous or one-hot encode their responses as additional columns (expect ~200).

colums to eliminate: Date received, Sub-issue, Consumer complaint narrative, State,
    Zip code, Date sent to company, Compaind ID

In [66]:
#data.head(20)

## Formatting Data

<a id='Formatting Data'></a>

In [47]:
cleanData= data.copy()

In [48]:
cleanData.dtypes

Date received                   object
Product                         object
Sub-product                     object
Issue                           object
Sub-issue                       object
Consumer complaint narrative    object
Company public response         object
Company                         object
State                           object
ZIP code                        object
Tags                            object
Consumer consent provided?      object
Submitted via                   object
Date sent to company            object
Company response to consumer    object
Timely response?                object
Consumer disputed?              object
Complaint ID                     int64
dtype: object

## Formatting Data

<a id='Formatting Data'></a>

## Formatting Data

<a id='Formatting Data'></a>

## Formatting Data

<a id='Formatting Data'></a>

In [70]:
text= 'diego'

In [71]:
text== str

False