# Python Basics
This notebook will demonstrate how to accomplish each of the tasks that Bob sent over as a starting point for working in the new Thrivent environment. They cover the following areas:
* Reading/Writing files - xlsx, csv
* Subsetting and merging data frames
* Adding calculated columns
* Basic statistics
* Plotting
* Modeling

### Load Packages

In [1]:
import pandas as pd # for data frames, reading and writing data
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
import psycopg2 # for connecting to a postgres database
import numpy as np # using this to create a range of floats

from scipy.stats import chi2_contingency, kstest


# the next line is so that the matplot lib plots show up in the notebook cell
%matplotlib inline

## Load Data
We'll use some data from the Twitter work we've been doing for most of the analysis. 

### 0. Read Data from Excel
I'm using pandas to read in the data file from excel. If the file is located in the same directory as the notebook, this will work. Otherwise, add the path to the file to the filename. Pandas will automatically infer data types, column numbers and rownumbers from the data. There are quite a few different arguments that you can pass to this function to control what is loaded and how. The following cell will bring up the docstring for this function that has explanations for all of the options.

In [2]:
?pd.read_excel

In [3]:
filename = 'sample_data.xlsx'
t_data = pd.read_excel(filename, sheet_name='tweets_classified')
t_data.head()

Unnamed: 0,text,class,topic
0,You remind me of my BimmerSee your ignition b...,0,Birth
1,RT @JaDineNATION: Were so excited for our dese...,0,Birth
2,i always get super self conscious about keepin...,0,Birth
3,@Juhhhhhhnelle Weird ass bitch lucky Im pregna...,1,Birth
4,RT @thdmichaelbell: New Signage above our Fron...,0,Birth


### 1. Export dataframe to tab delimited file
Now that we have some data to work with, we can export it to a tab-delimited file. After exporting, we'll remove the data frame and reload it from the csv file.
* setting the sep argument to '\t' makes it tab separated. Default is comma separated
* setting the index=False prevents it from writing out the row numbers as a column, creating an exraneous column.

In [4]:
export_filename = 'sample_data.csv'
t_data.to_csv(export_filename, sep='\t', index=False)
t_data = None

In [5]:
t_data = pd.read_csv(export_filename, sep='\t')
t_data.head()

Unnamed: 0,text,class,topic
0,You remind me of my BimmerSee your ignition b...,0,Birth
1,RT @JaDineNATION: Were so excited for our dese...,0,Birth
2,i always get super self conscious about keepin...,0,Birth
3,@Juhhhhhhnelle Weird ass bitch lucky Im pregna...,1,Birth
4,RT @thdmichaelbell: New Signage above our Fron...,0,Birth


Let's get an idea of what's in this dataframe - I know it has texts from different topics. Let's see how many from each are in there:

In [6]:
t_data['topic'].value_counts()

Marriage      241
Moving        240
Divorce       226
Graduation    226
Birth         199
Name: topic, dtype: int64

In [7]:
t_data['class'].value_counts()

0    883
1    206
2     43
Name: class, dtype: int64

Since we want the `class` variable to be binary, we have some data clean-up to do here. At some point I started using 2 for negatives, since it was easier on the keyboard than 0! Let's replace all of those 2s with 0 to make class truly binary.

In [8]:
t_data.loc[t_data['class']==2,'class'] = 0 
t_data['class'].value_counts()

0    926
1    206
Name: class, dtype: int64

In [9]:
# Check data types
t_data.dtypes

text     object
class     int64
topic    object
dtype: object

In [10]:
# Get some descriptive data from this dataframe
t_data.describe(include='all')

Unnamed: 0,text,class,topic
count,1132,1132.0,1132
unique,1122,,5
top,@ryrob51 @veyseyor @veyseyor Ryan is moving in,,Marriage
freq,2,,241
mean,,0.181979,
std,,0.385998,
min,,0.0,
25%,,0.0,
50%,,0.0,
75%,,0.0,


### Subsetting
Subsetting dataframes with Pandas is very similar to subsetting in R. Since the sample data has data from 5 different topics, let's pull out two topics and make them separate data frames.

Unlike R, when subsetting with Pandas you have to use `loc` or `iloc` before adding in the subset parameters. 
* `loc` is used when you have a criteria based on the values in a column or multiple columns
* `iloc` will give you the values from a numeric position in the dataframe. For example, if you wanted the first 10 rows of the data frame, you'd do the following:

*NOTE: Unlike R, Python is zero-based, so lists and indexes start at zero, rather than one.* 
#### Subsetting with `iloc`

In [11]:
# First 10 rows
t_data.iloc[0:10]

Unnamed: 0,text,class,topic
0,You remind me of my BimmerSee your ignition b...,0,Birth
1,RT @JaDineNATION: Were so excited for our dese...,0,Birth
2,i always get super self conscious about keepin...,0,Birth
3,@Juhhhhhhnelle Weird ass bitch lucky Im pregna...,1,Birth
4,RT @thdmichaelbell: New Signage above our Fron...,0,Birth
5,NEW SUPER BABY 2 SCAN AND TRANSLATION AND NEW ...,0,Birth
6,RT @Prof_Hariom: @OmarAbdullah Throw out of Bh...,0,Birth
7,RT @dre85567034: Mmm...Busty Preggos Yummy! Pa...,0,Birth
8,baby girl you re a star,0,Birth
9,RT @DDuaneOfficial: Glad I was able to speak w...,0,Birth


#### Subsetting with `loc`
When referencing columns in pandas, you can use either dataframe.column_name or dataframe['column name']. They should work the same way. Sometimes, maybe based on the column name itself, the .column_name doesn't work. ['column_name'] seems to be more reliable. In this dataframe, I had this issue with the 'class' column.

In [12]:
moving_df = t_data.loc[t_data.topic=='Moving']
moving_df.head()

Unnamed: 0,text,class,topic
892,Peaches is moving to the City where lots of ex...,0,Moving
893,The best thing we can realistically hope for i...,0,Moving
894,@Iam100Savage A Fund was moving out and hence ...,0,Moving
895,"You could win $4,000 towards a new home theate...",0,Moving
896,"If I wasn’t in a relationship, after graduatio...",0,Moving


In [13]:
marriage_df = t_data.loc[t_data['topic']=='Marriage']
marriage_df.head()

Unnamed: 0,text,class,topic
651,2 years ago i casually pinned wedding ideas da...,0,Marriage
652,so completely honored to stand next to this ma...,0,Marriage
653,the wedding chapel http://t.co/xthvajo4,0,Marriage
654,i love that my modern-day beauty &amp; the bea...,0,Marriage
655,on the train heading to pa for a friends weddi...,0,Marriage


#### Multiple subset criteria
This works the same way that subsetting in R does. Let's find all of the Marriage and Moving tweets where the Class==1. A common error when subsetting is: `The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().` If you get that, make sure you are using `&` and `|` for and/or operators. If it's still an issue, check parentheses - it seems like it needs more than necessary for the subset to work!

In [14]:
subset2 = t_data.loc[((t_data['topic']=='Marriage') & (t_data['class']==1)) |
                     ((t_data['topic']=='Moving') & (t_data['class']==1))]
subset2.head()

Unnamed: 0,text,class,topic
656,houston...i love you and i hate you right now....,1,Marriage
663,chaelisa for we got married,1,Marriage
664,here is how i feel about wedding thank you car...,1,Marriage
670,for all interested here is the video from our...,1,Marriage
682,thankful for an aunt who can do a few last min...,1,Marriage


### 2. Merging Data Frames
Now that we have two separate data frames for Marriage and Moving, let's merge them together and see if the number of class==1 matches our subset above. There are a lot of options when merging data frames - similar to joins with data tables. The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) is pretty helpful as is StackOverflow, of course.

This first example is just a combination of two dataframes... no index to match on. 

In [15]:
merged_df = marriage_df.merge(moving_df, how='outer')
print('Marrige data frame: {}'.format(len(marriage_df)))
print('Moving data frame: {}'.format(len(moving_df)))
print('Merged data frame: {}'.format(len(merged_df)))

Marrige data frame: 241
Moving data frame: 240
Merged data frame: 481


In [16]:
# Does our count of positive tweets match between the subset and the merged data?
len(subset2) == len(merged_df.loc[merged_df['class']==1])

True

We need some different data to show how to merge on different keys. Our sample file has data for this too in the `tweet_data` and `user_data` sheets.

In [17]:
filename

'sample_data.xlsx'

In [18]:
tweet_df = pd.read_excel(filename, sheet_name='tweet_data')
tweet_df.head()

Unnamed: 0,tweet_id,id_str,created_at,text,user_id,favorite_count,favorited,in_reply_to_status_id,in_reply_to_user_id,lang,place,retweet_count,retweeted
0,956547739344449024,956547739344449024,2018-01-25 15:22:07,Cant be able to stream awhile ago I m rushing ...,923525633036001024,0,False,,,en,,0,False
1,956548439587938048,956548439587938048,2018-01-25 15:24:54,RT @afaidd: Im in 40K+ debt... you think Im ju...,936985398391208960,0,False,,,en,,0,False
2,956548440372203008,956548440372203008,2018-01-25 15:24:54,Delighted to have graduated from column to ful...,374464187,0,False,,,en,,0,False
3,956548447393389952,956548447393389952,2018-01-25 15:24:56,RT @tbhOffice: Scotts Tots: We graduated! Can ...,2866235550,0,False,,,en,,0,False
4,956548504327015936,956548504327015936,2018-01-25 15:25:09,RT @jellsoval: realizing i m gonna graduate wi...,2723199724,0,False,,,en,,0,False


In [19]:
user_df = pd.read_excel(filename, sheet_name='user_data')
user_df.head()

Unnamed: 0,id,id_str,name,screen_name,location,followers_count,friends_count,favourites_count,description,geo_enabled,...,statuses_count,time_zone,created_at,verified,utc_offset,contributors_enabled,listed_count,protected,url,state
0,16202079,16202079,Mary Wade Atteberry,maryatteberry,Indianapolis,138,113,719,Public relations and marketing professional w...,0,...,1070,Eastern Time (US & Canada),2008-09-09 13:36:34,0,-18000.0,0,12,0,,Indiana
1,16251524,16251524,Sara Sanchez-Zweig,shouldbesara,Brooklyn NY,105,505,610,Chronic renewer of library books. Theater Per...,1,...,642,Quito,2008-09-12 02:08:03,0,-18000.0,0,1,0,http://lettersfromthemezz.com,
2,23331412,23331412,Tyler Harrison,JT_Harrison,Wichita KS,608,1184,1227,Husband Father. @WSUSportMgmt grad. @Wingnuts...,1,...,5453,Central Time (US & Canada),2009-03-08 17:42:46,0,-21600.0,0,10,0,,Kansas
3,24460145,24460145,Spontaneous #1 07,trish_08,Myrtle Beach SC,418,1424,2762,Educator and mentor..#TealNation #DeltaSigmaT...,1,...,4466,,2009-03-15 00:47:03,0,,0,0,0,,South Carolina
4,25812544,25812544,Tanay Modi,tanaymodi1,New York NY,639,364,4,Sevenoaks Tufts Citigroup Technology Inve...,1,...,9745,Eastern Time (US & Canada),2009-03-22 12:34:56,0,-18000.0,0,28,0,,Massachusetts


Now that we have a handful of tweets, we want to merge the tweet data with the user data to append specific user columns to the tweet data. Let's only grab a few columns from each data frame to keep it easy to read. We can select a subset of columns with no other criteria with `dataframe[[list of columns]]`.

The `how` parameter of the merge works like a join, defining what rows to keep when there isn't a match in both dataframes. It defaults to an inner join. In this case I want to keep all of the tweets, even if we don't have a user record, so I'm using `how=left` since the first table in the merge (the left one) is the tweet_df.

### 2. (cont): Merge using a unique match key.

In [20]:
merged_tweets = tweet_df[['tweet_id', 'created_at', 'user_id', 'text']].merge(
    user_df[['id', 'name','screen_name', 'followers_count']],
    left_on='user_id',
    right_on='id',
    how='left')

merged_tweets.head()

Unnamed: 0,tweet_id,created_at,user_id,text,id,name,screen_name,followers_count
0,956547739344449024,2018-01-25 15:22:07,923525633036001024,Cant be able to stream awhile ago I m rushing ...,923525633036001024,jeni #RETURN,beimylove,40
1,956548439587938048,2018-01-25 15:24:54,936985398391208960,RT @afaidd: Im in 40K+ debt... you think Im ju...,936985398391208960,Golovkin,Golovkin0,14
2,956548440372203008,2018-01-25 15:24:54,374464187,Delighted to have graduated from column to ful...,374464187,Daniel Murray,murraymuzz,5038
3,956548447393389952,2018-01-25 15:24:56,2866235550,RT @tbhOffice: Scotts Tots: We graduated! Can ...,2866235550,em,garciaemily903,347
4,956548504327015936,2018-01-25 15:25:09,2723199724,RT @jellsoval: realizing i m gonna graduate wi...,2723199724,dav,damnit_dav,578


### 3. Bin a continuous variable into a new variable.
Since we have a bunch of users, let's bin their followers_count into equal width bins. This was a new one for me, but there is a handy pandas function for it, similar to R, called [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html)

In [21]:
merged_tweets['binned_followers'] = pd.cut(merged_tweets['followers_count'], bins=5, labels = ['very_low',
                                                                                               'low',
                                                                                               'medium',
                                                                                               'high',
                                                                                              'very_high'])
merged_tweets['binned_followers'].value_counts()

very_low     198
very_high      1
high           1
medium         0
low            0
Name: binned_followers, dtype: int64

### 4. Binning with an equal number of members
Cutting the followers_count into equal sized bins wasn't very helpful, since there are two users with so many followers that the ranges become unuseful.  

More useful may be to use quartiles. For that, we'll have to calculate the quartiles ahead of time, then pass them into the `cut` function as the `squence of scalars`.

In [22]:
bins = 5
# use pandas.quantile function and np.linspace to generate the cutoff values for the cut-function.
cutoffs = list(merged_tweets['followers_count'].quantile(np.linspace(0,1,bins+1)))

# create some labels for our new, binned column
labels = ['Q'+str(x) for x in range(1,bins+1)]

# cut the data based on the cutoffs
merged_tweets['quartile_followers'] = pd.cut(merged_tweets['followers_count'], cutoffs, labels = labels)

# check if it worked
merged_tweets['quartile_followers'].value_counts()

Q1    41
Q5    40
Q4    40
Q3    40
Q2    38
Name: quartile_followers, dtype: int64

Let's see how that worked out. We'll look at the mean, median and median and standard deviation for each quartile of follower_counts. We can use the groupby function in pandas to get these aggregates.

In [23]:
merged_tweets.groupby('quartile_followers')['followers_count'].agg(['mean', 'median', 'std'])

Unnamed: 0_level_0,mean,median,std
quartile_followers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1,113.756098,126.0,65.506404
Q2,285.315789,291.5,49.215525
Q3,495.75,489.5,75.643479
Q4,810.775,801.0,135.940501
Q5,4024.45,2040.5,7469.608241


### 5. Create a two way frequency table
Back to our classified tweet data, we have different topics (multiple) and we have classification (binary). We can create a two-way frequency table showing the number of each class in each topic. We can use panads.crosstab to get to this result.

In [24]:
two_way = pd.crosstab(t_data['topic'],t_data['class'])
two_way

class,0,1
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
Birth,172,27
Divorce,213,13
Graduation,149,77
Marriage,196,45
Moving,196,44


You can also use the pivot_table function to get to the same results.

In [25]:
two_way_pivot = t_data.pivot_table(index='topic', columns = 'class', aggfunc=len)

Both of these results have a multi-part index, making it a little complicated to subset the results. Since it has a multiple index, you have to pass values or criteria for both components of the index.

In [26]:
two_way.loc[(['Marriage','Moving'],[0,1])]

class,0,1
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
Marriage,196,45
Moving,196,44


In [27]:
two_way_pivot.loc[['Marriage','Moving']]

Unnamed: 0_level_0,text,text
class,0,1
topic,Unnamed: 1_level_2,Unnamed: 2_level_2
Marriage,196,45
Moving,196,44


This pivot approach seems to be more difficult to subset, as doesn't like my second part of the index.

### Chi-Squared independence test, statistic and p-value
[Scipy](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html) has a good function for this. We can subset the crosstab to only grab the 0 and 1 columns and use that as our contingency table. The results from this function are, in order:
* chi2 : float, The test statistic.
* p : float, The p-value of the test
* dof : int, Degrees of freedom
* expected : ndarray, same shape as observed, The expected frequencies, based on the marginal sums of the table.

In [28]:
chi2 = chi2_contingency(two_way[[0,1]])

In [29]:
chi2

(64.6713594960832,
 3.0178804528059226e-13,
 4,
 array([[162.78621908,  36.21378092],
        [184.87279152,  41.12720848],
        [184.87279152,  41.12720848],
        [197.14310954,  43.85689046],
        [196.32508834,  43.67491166]]))

### 6. How to generate a summary of interval/continuous/numeric variables including
* Basic statistics like the mean, median, percentiles, standard deviation, etc.
* Confidence intervals around the mean

A quick way to get to some of this information is with the `describe` function on a dataframe. By default, this will only describe numeric variables:

In [30]:
merged_tweets.describe()

Unnamed: 0,tweet_id,user_id,id,followers_count
count,200.0,200.0,200.0,200.0
mean,9.56576e+17,1.261988e+17,1.261988e+17,1143.74
std,109191200000000.0,3.026004e+17,3.026004e+17,3616.80929
min,9.565477e+17,16202080.0,16202080.0,3.0
25%,9.565582e+17,391545200.0,391545200.0,231.75
50%,9.565735e+17,1599265000.0,1599265000.0,489.5
75%,9.565759e+17,3177317000.0,3177317000.0,911.25
max,9.581059e+17,9.560236e+17,9.560236e+17,40275.0


If we add `include='all'` we'll get descriptive date on the rest of the columns, with a bunch of 'NaN' for irrelevant statistics.

In [31]:
merged_tweets.describe(include='all')

Unnamed: 0,tweet_id,created_at,user_id,text,id,name,screen_name,followers_count,binned_followers,quartile_followers
count,200.0,200,200.0,200,200.0,196,200,200.0,200,199
unique,,191,,100,,194,200,,3,5
top,,2018-01-25 15:24:54,,RT @tbhOffice: Scotts Tots: We graduated! Can ...,,.,nowayjose101,,very_low,Q1
freq,,2,,56,,2,1,,198,41
first,,2018-01-25 15:22:07,,,,,,,,
last,,2018-01-29 22:33:53,,,,,,,,
mean,9.56576e+17,,1.261988e+17,,1.261988e+17,,,1143.74,,
std,109191200000000.0,,3.026004e+17,,3.026004e+17,,,3616.80929,,
min,9.565477e+17,,16202080.0,,16202080.0,,,3.0,,
25%,9.565582e+17,,391545200.0,,391545200.0,,,231.75,,


Since this gives you the standard deviation for numeric fields, you can use that to create confidence intervals as needed. Here's how you can pull values out of this function - let's say we want the standard deviation for the followers_count:

In [32]:
# set the description to a data frame variable, then pull the value as a subset
desc = merged_tweets.describe(include='all')
desc.loc['std','followers_count']

3616.809289912587

In [33]:
# or call the function and pull the value directly from the results, if there's no other need for that data.
merged_tweets.describe().loc['std','followers_count']

3616.809289912587

In [34]:
# Double checking that it's right using the standard deviation function (std)
merged_tweets['followers_count'].std()

3616.809289912587