# [Pandas](https://pandas.pydata.org/)
According to the pandas website, pandas helps fill the gap between data munging and preparation and data analysis, "enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R."

Like data frames in R, the main structure in Pandas is a data frame. We'll start a new Notebook to experiment and explore Pandas further.

In [1]:
import pandas as pd # for data frames, reading and writing data
from matplotlib import pyplot as plt
import psycopg2 # for connecting to a postgres database
import numpy as np # using this to create a range of floats
from math import sqrt

# the next line is so that the matplot lib plots show up in the notebook cell
%matplotlib inline

## Pandas Basics
* Create a data frame from scratch
* Adding/removing columns
* Descriptive information

Unlike NumPy arrays and matrices, a Pandas data frame can hold different data types. However, a data frame is made up of `Pandas.Series` (columns) which must all be of the same data type. Let's create a dataframe from scratch with a few columns of different data types.

*Create a data frame with three columns:*
* 'numbers' (integers)
* 'floats' (floats)
* 'names' (strings)*

In [2]:
# use pd.DataFrame to create the data frame. 
# You can create the data fields in a dictionary before hand, or directly in the call to pd.DataFrame
df = 

# View The resulting dataframe
df

Unnamed: 0,numbers,floats,names
a,10,1.5,Yves
b,20,2.5,Guido
c,30,3.5,Felix
d,40,4.5,Francesc


Add a calculated column:

In [3]:
# Add a calculated column as the product of the numbers and floats: 
df['calc_col'] = 

#View the resulting dataframe
df

Unnamed: 0,numbers,floats,names,calc_col
a,10,1.5,Yves,15.0
b,20,2.5,Guido,50.0
c,30,3.5,Felix,105.0
d,40,4.5,Francesc,180.0


Let's add some missing data so that we can look at how pandas treats it and how to find it when loading data sets later.

In [4]:
# Create a new row as a dictionary and use np.nan for missing values. 
# Append the new row to our data frame with df.append.
# Use "ignore_index=True" when appending
# Reset the index to the names column after appending

#View the resulting dataframe
df

Unnamed: 0_level_0,numbers,floats,names,calc_col
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Yves,10.0,1.5,Yves,15.0
Guido,20.0,2.5,Guido,50.0
Felix,30.0,3.5,Felix,105.0
Francesc,40.0,4.5,Francesc,180.0
Mario,,,Mario,


### Data Frame Descriptive Info
* Column names
* Length
* Missing data

We can use the `describe` method on a data frame to get some basic statistics on the column. The default is to only include numerical columns. Try it with and without inlclude='all' to see the different versions of the results. 

In [5]:
df.describe()

Unnamed: 0,numbers,floats,calc_col
count,4.0,4.0,4.0
mean,25.0,3.0,87.5
std,12.909944,1.290994,71.937473
min,10.0,1.5,15.0
25%,17.5,2.25,41.25
50%,25.0,3.0,77.5
75%,32.5,3.75,123.75
max,40.0,4.5,180.0


In [6]:
df.describe(include='all')

Unnamed: 0,numbers,floats,names,calc_col
count,4.0,4.0,5,4.0
unique,,,5,
top,,,Mario,
freq,,,1,
mean,25.0,3.0,,87.5
std,12.909944,1.290994,,71.937473
min,10.0,1.5,,15.0
25%,17.5,2.25,,41.25
50%,25.0,3.0,,77.5
75%,32.5,3.75,,123.75


In [7]:
print("Numbers:\nmean: {:.2f}\nstd: {:2f}".format(df.numbers.mean(), 
                                                  df.numbers.std()))

Numbers:
mean: 25.00
std: 12.909944


## Day to day Pandas
* Reading/Writing files - xlsx, csv
* Subsetting and merging data frames
* Plotting

## Load Data from a Database
We'll use some data from the Twitter work we've been doing for most of the analysis. I'll start by pulling data from our Postgres database on AWS. You cannot connect to this database without someone adding your IP address to the security group on AWS, but this will show you how to pull data from a database.

I'll pull 200 tweets from each topic to get a good mix. I'll pull each into a pandas DataFrame and merge them all together. Finally, I'll save them to Excel for the sample data for you to use. 

BEWARE - when pulling twitter ids (or any very large integers) into Excel, Excel tends to round them ton 15 places, losing 3 digits and making joins and merges break.

NOTE: I'm commenting all of this database stuff out, since you won't be able to connect anyway. Leaving the cells for my reference and yours, so you can see how database connections work with pandas.

In [8]:
# # Database Connection parapeters
# hostname = 'ditwitter.c6rgtnn1vfuu.us-east-1.rds.amazonaws.com'
# username = 'ditwitter_sa'
# pwd = 'ThriventTwitter'
# database = 'ditwitter'

# # Connect
# conn = psycopg2.connect( host=hostname, user=username, password=pwd, dbname=database )

In [9]:
# # First let's get a list of topics:
# SQL = """SELECT DISTINCT topics.* 
#         FROM topics
#         INNER JOIN models ON md_tp_id = tp_id"""

# topics_df = pd.read_sql(SQL, con=conn)
# topics_df

### Tweet Data

In [10]:
# # Create an empty data frame too hold the tweets we're going to collect
# tweets_df = pd.DataFrame()

# # Loop through all the active topics and grab a block of tweets, then merge with the tweets_df
# block_size = 200

# for tp_id in topics_df['tp_id']:
#     SQL = """SELECT t.*, tp_name 
#     FROM tweets t
#     INNER JOIN tweet_scores ts ON ts.ts_tweet_id = t.tweet_id
#     INNER JOIN models m ON m.md_id = ts.ts_md_id
#     INNER JOIN topics tp ON tp.tp_id = m.md_tp_id
#     WHERE tp_id = {}
#     LIMIT {}""".format(tp_id, block_size)
    
#     tweet_block = pd.read_sql(SQL, conn)
# #     print("pulled {} for topic_id: {}.".format(len(tweet_block), tp_id))
#     tweets_df = tweets_df.append(tweet_block)

# tweets_df.head()

### User Data
Let's pull the user data for all of these records. To do that, we'll need to build a "WHERE" clause that has all the unique user_ids from our tweets dataframe. We'll need to convert the values to strings, then separate them by commas. 

In [11]:
# # Build the list of user_ids
# sep = ','
# users_string = sep.join(tweets_df['user_id'].astype(str))

# SQL = "SELECT * FROM users WHERE id in ({})".format(users_string)
# users_df = pd.read_sql(SQL, conn)
# print(len(users_df))
# users_df.head()

## Save to Excel
Now that I've pulled the data for the examples, I'll save it to Excel for easy distribution. This is where we have to do something to get around the Excel string conversion issue. We don't care about tweet_ids, but we DO care about user_ids, since that's the field that we'll later join these two datasets on. To avoid the conversion issue, we'll add a column to the tweets_data that converts the user_id to string. We already have this column in the user data as `id_str`.

*NOTE: I've commented out the save-to-excel code, since I've since added other data to the sample_data.xlsx file that I don't want overwritten. Leaving it in here for reference.*

In [12]:
# tweets_df['user_id_str'] = tweets_df['user_id'].astype(str)

In [13]:
# Create a Pandas Excel writer using XlsxWriter as the engine.
# writer = pd.ExcelWriter('sample_data.xlsx', engine='xlsxwriter')

# # Write each dataframe to a different worksheet.
# tweets_df.to_excel(writer, sheet_name='tweet_data', index=False)
# users_df.to_excel(writer, sheet_name='user_data', index=False)
# writer.save()


### Read Data from Excel
I'm using pandas to read in the data file from excel. If the file is located in the same directory as the notebook, this will work. Otherwise, add the path to the file to the filename. Pandas will automatically infer data types, column numbers and rownumbers from the data. There are quite a few different arguments that you can pass to this function to control what is loaded and how. The following cell will bring up the docstring for this function that has explanations for all of the options.

In [14]:
filename = 'sample_data.xlsx'
t_data = pd.read_excel(filename, sheet_name='tweets_classified')
t_data.head()

Unnamed: 0,text,class,topic
0,You remind me of my BimmerSee your ignition b...,0,Birth
1,RT @JaDineNATION: Were so excited for our dese...,0,Birth
2,i always get super self conscious about keepin...,0,Birth
3,@Juhhhhhhnelle Weird ass bitch lucky Im pregna...,1,Birth
4,RT @thdmichaelbell: New Signage above our Fron...,0,Birth


### Export dataframe to tab delimited file
Now that we have some data to work with, we can export it to a tab-delimited file. After exporting, we'll remove the data frame and reload it from the csv file.
* setting the sep argument to '\t' makes it tab separated. Default is comma separated
* setting the index=False prevents it from writing out the row numbers as a column, creating an exraneous column.

In [15]:
export_filename = 'sample_data.csv'
t_data.to_csv(export_filename, sep='\t', index=False)
t_data = None

In [16]:
t_data = pd.read_csv(export_filename, sep='\t')
t_data.head()

Unnamed: 0,text,class,topic
0,You remind me of my BimmerSee your ignition b...,0,Birth
1,RT @JaDineNATION: Were so excited for our dese...,0,Birth
2,i always get super self conscious about keepin...,0,Birth
3,@Juhhhhhhnelle Weird ass bitch lucky Im pregna...,1,Birth
4,RT @thdmichaelbell: New Signage above our Fron...,0,Birth


Let's get an idea of what's in this dataframe - I know it has texts from different topics. Let's see how many from each are in there:

In [17]:
t_data['topic'].value_counts()

Marriage      241
Moving        240
Graduation    226
Divorce       226
Birth         199
Name: topic, dtype: int64

In [18]:
t_data['class'].value_counts()

0    883
1    206
2     43
Name: class, dtype: int64

Since we want the `class` variable to be binary, we have some data clean-up to do here. At some point I started using 2 for negatives, since it was easier on the keyboard than 0! Let's replace all of those 2s with 0 to make class truly binary.

In [19]:
t_data.loc[t_data['class']==2,'class'] = 0 
t_data['class'].value_counts()

0    926
1    206
Name: class, dtype: int64

In [20]:
# Check data types
t_data.dtypes

text     object
class     int64
topic    object
dtype: object

In [21]:
# Get some descriptive data from this dataframe
t_data.describe(include='all')

Unnamed: 0,text,class,topic
count,1132,1132.0,1132
unique,1122,,5
top,Brilliant move by @Marvel / @Disney moving Inf...,,Marriage
freq,2,,241
mean,,0.181979,
std,,0.385998,
min,,0.0,
25%,,0.0,
50%,,0.0,
75%,,0.0,


### Subsetting
Subsetting dataframes with Pandas is very similar to subsetting in R. Since the sample data has data from 5 different topics, let's pull out two topics and make them separate data frames.

Unlike R, when subsetting with Pandas you have to use `loc` or `iloc` before adding in the subset parameters. 
* `loc` is used when you have a criteria based on the values in a column or multiple columns
* `iloc` will give you the values from a numeric position in the dataframe. For example, if you wanted the first 10 rows of the data frame, you'd do the following:

*NOTE: Unlike R, Python is zero-based, so lists and indexes start at zero, rather than one.* 
#### Subsetting with `iloc`

In [22]:
# First 10 rows
t_data.iloc[0:10]

Unnamed: 0,text,class,topic
0,You remind me of my BimmerSee your ignition b...,0,Birth
1,RT @JaDineNATION: Were so excited for our dese...,0,Birth
2,i always get super self conscious about keepin...,0,Birth
3,@Juhhhhhhnelle Weird ass bitch lucky Im pregna...,1,Birth
4,RT @thdmichaelbell: New Signage above our Fron...,0,Birth
5,NEW SUPER BABY 2 SCAN AND TRANSLATION AND NEW ...,0,Birth
6,RT @Prof_Hariom: @OmarAbdullah Throw out of Bh...,0,Birth
7,RT @dre85567034: Mmm...Busty Preggos Yummy! Pa...,0,Birth
8,baby girl you re a star,0,Birth
9,RT @DDuaneOfficial: Glad I was able to speak w...,0,Birth


#### Subsetting with `loc`
When referencing columns in pandas, you can use either dataframe.column_name or dataframe['column name']. They should work the same way. Sometimes, maybe based on the column name itself, the .column_name doesn't work. ['column_name'] seems to be more reliable. In this dataframe, I had this issue with the 'class' column.

In [23]:
moving_df = t_data.loc[t_data.topic=='Moving']
moving_df.head()

Unnamed: 0,text,class,topic
892,Peaches is moving to the City where lots of ex...,0,Moving
893,The best thing we can realistically hope for i...,0,Moving
894,@Iam100Savage A Fund was moving out and hence ...,0,Moving
895,"You could win $4,000 towards a new home theate...",0,Moving
896,"If I wasn’t in a relationship, after graduatio...",0,Moving


In [24]:
marriage_df = t_data.loc[t_data['topic']=='Marriage']
marriage_df.head()

Unnamed: 0,text,class,topic
651,2 years ago i casually pinned wedding ideas da...,0,Marriage
652,so completely honored to stand next to this ma...,0,Marriage
653,the wedding chapel http://t.co/xthvajo4,0,Marriage
654,i love that my modern-day beauty &amp; the bea...,0,Marriage
655,on the train heading to pa for a friends weddi...,0,Marriage


#### Multiple subset criteria
This works the same way that subsetting in R does. Let's find all of the Marriage and Moving tweets where the Class==1. A common error when subsetting is: `The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().` If you get that, make sure you are using `&` and `|` for and/or operators. If it's still an issue, check parentheses - it seems like it needs more than necessary for the subset to work!

In [25]:
subset2 = t_data.loc[((t_data['topic']=='Marriage') & (t_data['class']==1)) |
                     ((t_data['topic']=='Moving') & (t_data['class']==1))]
subset2.head()

Unnamed: 0,text,class,topic
656,houston...i love you and i hate you right now....,1,Marriage
663,chaelisa for we got married,1,Marriage
664,here is how i feel about wedding thank you car...,1,Marriage
670,for all interested here is the video from our...,1,Marriage
682,thankful for an aunt who can do a few last min...,1,Marriage


### Merging Data Frames
Now that we have two separate data frames for Marriage and Moving, let's merge them together and see if the number of class==1 matches our subset above. There are a lot of options when merging data frames - similar to joins with data tables. The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) is pretty helpful as is StackOverflow, of course.

This first example is just a combination of two dataframes... no index to match on. 

In [26]:
merged_df = marriage_df.merge(moving_df, how='outer')
print('Marrige data frame: {}'.format(len(marriage_df)))
print('Moving data frame: {}'.format(len(moving_df)))
print('Merged data frame: {}'.format(len(merged_df)))

Marrige data frame: 241
Moving data frame: 240
Merged data frame: 481


In [27]:
# Does our count of positive tweets match between the subset and the merged data?
len(subset2) == len(merged_df.loc[merged_df['class']==1])

True

We need some different data to show how to merge on different keys. Our sample file has data for this too in the `tweet_data` and `user_data` sheets.

In [28]:
tweet_df = pd.read_excel(filename, sheet_name='tweet_data')
tweet_df.head()

Unnamed: 0,tweet_id,id_str,created_at,text,user_id,favorite_count,favorited,in_reply_to_status_id,in_reply_to_user_id,lang,place,retweet_count,retweeted,tp_name,user_id_str
0,1025402184198967040,1025402184198967296,2018-08-03 15:25:06,RT @NdaliOzegbe: Imagine going to school grad...,327673809,0,False,,,en,,0,False,Graduation,327673809
1,1025402408992734976,1025402408992735233,2018-08-03 15:26:00,RT @unclenick_00: IM KEEPING THIS SAME ENERGY ...,578247241,0,False,,,en,,0,False,Graduation,578247241
2,1025402986762256000,1025402986762256384,2018-08-03 15:28:17,RT @dollyslibrary: If your little one is gradu...,108829376,0,False,,,en,,0,False,Graduation,108829376
3,1025403397468561024,1025403397468561410,2018-08-03 15:29:55,#IRememberATime when I thought graduating coll...,1016737236497457024,0,False,,,en,,0,False,Graduation,1016737236497457152
4,1025403499750859008,1025403499750858757,2018-08-03 15:30:20,Class of 2017 have you secured your spot on y...,895222808707551232,0,False,,,en,,0,False,Graduation,895222808707551232


In [29]:
user_df = pd.read_excel(filename, sheet_name='user_data')
user_df.head()

Unnamed: 0,id,id_str,name,screen_name,location,followers_count,friends_count,favourites_count,description,geo_enabled,...,statuses_count,time_zone,created_at,verified,utc_offset,contributors_enabled,listed_count,protected,url,state
0,8192222,8192222,Jezebel,Jezebel,,318516,29,94,All the news you need. Without airbrushing.,0,...,81056,Eastern Time (US & Canada),2007-08-14 22:57:34,1,-14400.0,0,6709,0,http://jezebel.com,
1,11801852,11801852,Jenna Hatfield,JennaHatfield,Cambridge OH,10024,6808,21156,Award winning writer. Editor. Wife. Mom. Dog l...,0,...,106425,Eastern Time (US & Canada),2008-01-03 15:51:49,0,-14400.0,0,577,0,http://stopdropandblog.com,Ohio
2,12366342,12366342,King County Library,KCLS,King County WA,10442,235,4417,King County Library System (KCLS) is your comm...,1,...,17786,Pacific Time (US & Canada),2008-01-17 17:51:28,0,-28800.0,0,548,0,http://www.kcls.org,Washington
3,14362996,14362996,Alanna Banks,fridaysoffshop,Toronto,765,850,32,Shop Owner at fridaysoff.ca an online source o...,0,...,1946,Quito,2008-04-11 17:49:07,0,-18000.0,0,48,0,http://fridaysoff.ca,
4,15430687,15430687,C.B. Cebulski,CBCebulski,Shanghai China,55961,869,11642,Just a guy lucky enough to work for Marvel. Tr...,0,...,22157,Eastern Time (US & Canada),2008-07-14 19:04:35,1,-14400.0,0,1889,0,http://www.eataku.tumblr.com,


Now that we have a handful of tweets, we want to merge the tweet data with the user data to append specific user columns to the tweet data. Let's only grab a few columns from each data frame to keep it easy to read. We can select a subset of columns with no other criteria with `dataframe[[list of columns]]`.

The `how` parameter of the merge works like a join, defining what rows to keep when there isn't a match in both dataframes. It defaults to an inner join. In this case I want to keep all of the tweets, even if we don't have a user record, so I'm using `how=left` since the first table in the merge (the left one) is the tweet_df.

NOTE: join on tweet_df.user_id_str = user_df.id_str, to avoid any truncation of the long integers that may have happend in exporting to Excel!

### Merge using a unique match key.

In [30]:
merged_tweets = tweet_df[['tweet_id', 'created_at', 'user_id_str', 'text']].merge(
    user_df[['id_str', 'name','screen_name', 'followers_count']],
    left_on='user_id_str',
    right_on='id_str',
    how='left')

merged_tweets.head()

Unnamed: 0,tweet_id,created_at,user_id_str,text,id_str,name,screen_name,followers_count
0,1025402184198967040,2018-08-03 15:25:06,327673809,RT @NdaliOzegbe: Imagine going to school grad...,327673809,Ya Girl,bellabiceps,1019
1,1025402408992734976,2018-08-03 15:26:00,578247241,RT @unclenick_00: IM KEEPING THIS SAME ENERGY ...,578247241,Big Tit Energy,theCoolestLame3,6889
2,1025402986762256000,2018-08-03 15:28:17,108829376,RT @dollyslibrary: If your little one is gradu...,108829376,Kat Dickinson,DickinsonKat,76
3,1025403397468561024,2018-08-03 15:29:55,1016737236497457152,#IRememberATime when I thought graduating coll...,1016737236497457152,kandidklerity,kandidklerity,2
4,1025403499750859008,2018-08-03 15:30:20,895222808707551232,Class of 2017 have you secured your spot on y...,895222808707551232,VDS Training,trainingVDS,639


### Bin a continuous variable into a new variable.
Since we have a bunch of users, let's bin their followers_count into equal width bins. This was a new one for me, but there is a handy pandas function for it, similar to R, called [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html)

In [31]:
merged_tweets['binned_followers'] = pd.cut(merged_tweets['followers_count'], bins=5, labels = ['very_low',
                                                                                               'low',
                                                                                               'medium',
                                                                                               'high',
                                                                                              'very_high'])
merged_tweets['binned_followers'].value_counts()

very_low     794
very_high      2
high           2
low            2
medium         0
Name: binned_followers, dtype: int64

### Binning with an equal number of members
Cutting the followers_count into equal sized bins wasn't very helpful, since there are two users with so many followers that the ranges become unuseful.  

More useful may be to use quartiles. For that, we'll have to calculate the quartiles ahead of time, then pass them into the `cut` function as the `squence of scalars`.

In [32]:
bins = 5
# use pandas.quantile function and np.linspace to generate the cutoff values for the cut-function.
cutoffs = list(merged_tweets['followers_count'].quantile(np.linspace(0,1,bins+1)))

# create some labels for our new, binned column
labels = ['Q'+str(x) for x in range(1,bins+1)]

# cut the data based on the cutoffs
merged_tweets['quartile_followers'] = pd.cut(merged_tweets['followers_count'], cutoffs, labels = labels)

# check if it worked
merged_tweets['quartile_followers'].value_counts()

Q5    160
Q4    160
Q3    160
Q2    159
Q1    157
Name: quartile_followers, dtype: int64

Let's see how that worked out. We'll look at the mean, median and median and standard deviation for each quartile of follower_counts. We can use the groupby function in pandas to get these aggregates.

In [42]:
merged_tweets.groupby('quartile_followers')['followers_count'].agg(['count','min','max','mean', 'median', 'std'])

Unnamed: 0_level_0,count,min,max,mean,median,std
quartile_followers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Q1,157,2,110,48.624204,46.0,31.56138
Q2,159,111,335,213.628931,206.0,67.487416
Q3,160,339,663,499.0875,491.5,99.045786
Q4,160,666,1509,978.85,924.5,232.875225
Q5,160,1527,318516,15506.7375,2819.0,45246.019754


### Macro variable for data selection
From Corinne: *In SAS we frequently create a list of variables by putting them in a macro variable that we use for data exploration and variable selection so that we can perform the necessary tasks for all variables easily.*

Using the user_df data frame, let's pick a subset of data for our macro variable. The macro variable will be a list with the column names we want.

In [34]:
# Show the list of all columns
list(user_df.columns)

['id',
 'id_str',
 'name',
 'screen_name',
 'location',
 'followers_count',
 'friends_count',
 'favourites_count',
 'description',
 'geo_enabled',
 'lang',
 'statuses_count',
 'time_zone',
 'created_at',
 'verified',
 'utc_offset',
 'contributors_enabled',
 'listed_count',
 'protected',
 'url',
 'state']

In [35]:
# Create our subset variable - let's pick all the numerical fields
col_subset = [
    'followers_count',
    'friends_count',
    'favourites_count',
    'statuses_count',
    'listed_count']

# Now we can use this variable to select from the data:
user_df[col_subset].head()

Unnamed: 0,followers_count,friends_count,favourites_count,statuses_count,listed_count
0,318516,29,94,81056,6709
1,10024,6808,21156,106425,577
2,10442,235,4417,17786,548
3,765,850,32,1946,48
4,55961,869,11642,22157,1889


### Crosstab, Pivots and GroupBy
Back to our classified tweet data, we have different topics (multiple) and we have classification (binary). We can create a two-way frequency table showing the number of each class in each topic. We can use panads.crosstab to get to this result.

#### Crosstab

In [36]:
two_way = pd.crosstab(t_data['topic'],t_data['class'])
two_way

class,0,1
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
Birth,172,27
Divorce,213,13
Graduation,149,77
Marriage,196,45
Moving,196,44


#### Pivot Table
You can also use the pivot_table function to get to the same results.

In [37]:
two_way_pivot = t_data.pivot_table(index='topic', columns = 'class', aggfunc=len)

Both of these results have a multi-part index, making it a little complicated to subset the results. Since it has a multiple index, you have to pass values or criteria for both components of the index.

In [38]:
two_way.loc[(['Marriage','Moving'],[0,1])]

class,0,1
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
Marriage,196,45
Moving,196,44


In [39]:
two_way_pivot.loc[['Marriage','Moving']]

Unnamed: 0_level_0,text,text
class,0,1
topic,Unnamed: 1_level_2,Unnamed: 2_level_2
Marriage,196,45
Moving,196,44


This pivot approach seems to be more difficult to subset, as doesn't like my second part of the index.

Let's revisit our GroupBy table from the previous section and look at a pivot version of it. Here we are looking to get aggregate data for the different quartiles that we created based on follower_count:

In [43]:
grouped_df = merged_tweets.groupby('quartile_followers')['followers_count'].agg(['count','min','max','mean', 'median', 'std'])
grouped_df

Unnamed: 0_level_0,count,min,max,mean,median,std
quartile_followers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Q1,157,2,110,48.624204,46.0,31.56138
Q2,159,111,335,213.628931,206.0,67.487416
Q3,160,339,663,499.0875,491.5,99.045786
Q4,160,666,1509,978.85,924.5,232.875225
Q5,160,1527,318516,15506.7375,2819.0,45246.019754


### More Pivoting
Similar to pivot tables in Excel, Pandas creates a hierarchy based on the pivot values and then applies one or multiple aggregate functions to the values not included in the index.

In [103]:
pivot_data = user_df[['followers_count', 'friends_count', 'favourites_count', 'lang', 'statuses_count', 'time_zone', 'state']]
pivoted = pivot_data.pivot_table(index = ['time_zone', 'state'], aggfunc=['count','mean'])
pivoted.columns

MultiIndex(levels=[['count', 'mean'], ['favourites_count', 'followers_count', 'friends_count', 'lang', 'statuses_count']],
           labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 4]])

Selecting data from a dataframe with a multi-level index is more complicated than a simple data frame. This [reference](https://pandas.pydata.org/pandas-docs/stable/advanced.html) can help. Let's select two timezones from the pivot table that we just created and the mean of two of the columns.

In [106]:
pivoted.loc[(['Eastern Time (US & Canada)','Central Time (US & Canada)'], 'mean')][['friends_count', 'favourites_count']]

Unnamed: 0_level_0,Unnamed: 1_level_0,friends_count,favourites_count
time_zone,state,Unnamed: 2_level_1,Unnamed: 3_level_1
Central Time (US & Canada),Arkansas,438.0,3257.0
Central Time (US & Canada),Florida,1977.0,5221.0
Central Time (US & Canada),Iowa,37.0,401.0
Central Time (US & Canada),Kansas,287.0,405.0
Central Time (US & Canada),Minnesota,0.0,0.0
Central Time (US & Canada),Missouri,403.0,6106.0
Central Time (US & Canada),Nebraska,666.0,24920.0
Central Time (US & Canada),North Dakota,551.0,4554.0
Central Time (US & Canada),South Dakota,115.0,5098.0
Central Time (US & Canada),Texas,1320.666667,5337.0


In [116]:
pivoted2 = pivot_data.pivot_table(index = ['time_zone'], aggfunc=['count','mean'])
pivoted2.loc[['Eastern Time (US & Canada)','Central Time (US & Canada)'], 'mean'][['friends_count', 'favourites_count']]

Unnamed: 0_level_0,friends_count,favourites_count
time_zone,Unnamed: 1_level_1,Unnamed: 2_level_1
Eastern Time (US & Canada),990.727273,21605.227273
Central Time (US & Canada),656.2,17371.36


### How to generate a summary of interval/continuous/numeric variables including
* Basic statistics like the mean, median, percentiles, standard deviation, etc.
* Confidence intervals around the mean

A quick way to get to some of this information is with the `describe` function on a dataframe. By default, this will only describe numeric variables:

In [93]:
merged_tweets.describe()

Unnamed: 0,tweet_id,user_id_str,id_str,followers_count
count,800.0,800.0,800.0,800.0
mean,1.002401e+18,2.188585e+17,2.188585e+17,3448.93625
std,2.526354e+16,3.834958e+17,3.834958e+17,21068.881171
min,9.712287e+17,8192222.0,8192222.0,0.0
25%,9.721836e+17,314186500.0,314186500.0,156.25
50%,1.015113e+18,1467469000.0,1467469000.0,491.5
75%,1.025432e+18,4914693000.0,4914693000.0,1134.5
max,1.026743e+18,1.026092e+18,1.026092e+18,318516.0


If we add `include='all'` we'll get descriptive date on the rest of the columns, with a bunch of 'NaN' for irrelevant statistics.

In [94]:
merged_tweets.describe(include='all')

Unnamed: 0,tweet_id,created_at,user_id_str,text,id_str,name,screen_name,followers_count,binned_followers,quartile_followers
count,800.0,800,800.0,800,800.0,782.0,800,800.0,800,796
unique,,704,,699,,689.0,724,,4,5
top,,2018-08-03 15:52:45,,RT @MooseWD: Quick update from me: I found the...,,,xosj_,,very_low,Q5
freq,,4,,8,,7.0,2,,794,160
first,,2018-03-07 03:39:12,,,,,,,,
last,,2018-08-07 08:13:59,,,,,,,,
mean,1.002401e+18,,2.188585e+17,,2.188585e+17,,,3448.93625,,
std,2.526354e+16,,3.834958e+17,,3.834958e+17,,,21068.881171,,
min,9.712287e+17,,8192222.0,,8192222.0,,,0.0,,
25%,9.721836e+17,,314186500.0,,314186500.0,,,156.25,,


Since this gives you the standard deviation for numeric fields, you can use that to create confidence intervals as needed. Here's how you can pull values out of this function - let's say we want the standard deviation for the followers_count:

In [95]:
# set the description to a data frame variable, then pull the value as a subset
desc = merged_tweets.describe(include='all')
desc.loc['std','followers_count']

21068.88117115957

In [96]:
# or call the function and pull the value directly from the results, if there's no other need for that data.
merged_tweets.describe().loc['std','followers_count']

21068.88117115957

In [97]:
# Double checking that it's right using the standard deviation function (std)
merged_tweets['followers_count'].std()

21068.88117115957