# SI 618: Data Manipulation and Analysis
## 06 - Categorical Data & Text Processing 
### Pivoting, contingency tables, crosstabs, mosaic plots and chi-squared

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.


## Overview for today
* Review HW1
* Project proposal review 
* Categorical Data: contingency tables, crosstabs, mosaic plots, chi-squared
* Text Processing: regular expressions

## Q0: What did you find confusing from last class?

Nothing really; it seemed pretty straight-forward.

# Categorical Data

## Contingency tables, crosstabs, and chi-square

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

Let's generate a data frame to play with:

In [None]:
df = pd.DataFrame({'color' : ['red', 'green', 'green', 'black'] * 6,
                   'make' : ['ford', 'toyota', 'dodge'] * 8,
                   'vehicleClass' : ['suv', 'suv', 'suv', 'car', 'car', 'truck'] * 4})

In [None]:
df.head()

One of the most basic transformations we can do is a crosstab:

In [None]:
ct = pd.crosstab(df.color,df.vehicleClass)
ct

Notice how similar it is to pivoting.  In fact, go ahead and use ```pivot_table``` to do the same sort of transformation:

### <font color="magenta">Q1: Use ```pivot_table``` to create a DataFrame similar to the one from the ```crosstab``` above:

In [None]:
# Add your code here

As usual, we would like to visualize our results:

In [None]:
import seaborn as sns

In [None]:
sns.heatmap(ct,annot=True)

### Titanic data

One of the more popular datasets that we use for experimenting with crosstabs is the 
survivor data from the Titanic disaster:

In [None]:
titanic = pd.read_csv('data/titanic.csv')

Let's create a crosstab of the data:

In [None]:
ct = pd.crosstab(titanic.passtype,titanic.status,margins=True)
ct

Now let's use our knowledge of data manipulation with pandas to generate some percentages totals:

### <font color="magenta">Q2: Generate this:</font>

![](assets/samplect.png)

In [None]:
# Add your code here

### <font color="magenta">Q3: Is this what we would have expected?</font>

In [None]:
# fill in the correct numbers on the next two lines (where np.NaN is right now)
expectedAlive = ctExt.total * np.NaN
expectedDead = ctExt.total * np.NaN

### Now generate a similar matrix for the *expected* (as oppposed to observed) values:

In [None]:
ctExpected = ct.copy()
ctExpected.alive = expectedAlive
ctExpected.dead = expectedDead
ctExpected['total'] = ctExpected.sum(axis=1)
ctExpected.loc['total'] = ctExpected.sum(axis=0)
alivePercent = np.round(ctExpected.alive/ctExpected.total * 100,decimals=2)
deadPercent = np.round(ctExpected.dead/ctExpected.total * 100,decimals=2)
totalPercent = np.round(ctExpected.total/ctExpected.total * 100,decimals=2)
detailExp = ctExpected.copy()
detailExp.alive = ctExpected.alive.astype('str') + " (" + alivePercent.astype('str') + "%)"
detailExp.dead = ctExpected.dead.astype('str') + " (" + deadPercent.astype('str') + "%)"
detailExp.total = ctExpected.total.astype('str') + " (" + totalPercent.astype('str') + "%)"

In [None]:
detailExp

In [None]:
detailCT

So, there we have the expected and observed values, along with their proportions.

In addition to the heatmap shown above, we can use a mosaic plot to visualize 
contingency tables:

In [None]:
from statsmodels.graphics.mosaicplot import mosaic
t = mosaic(titanic, ['passtype','status'],title='titanic survival')

In [None]:
# slightly easier to read
props = lambda key: {'color': 'r' if 'alive' in key else 'gray'}
t = mosaic(titanic, ['passtype','status'],title='titanic survival',properties=props)

Finally, we can go beyond visual exploration and apply analytic tests to see if the 
observed values differ from the expected ones.  The chi-square test sums the squares of the differences
between the observed and expected values, normalized for the expected values.

## Let's talk about $\chi^2$

In [None]:
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency(ct)
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)
print("Expected:")
pd.DataFrame(ex)

## Let's apply these ideas to another dataset

For this component, we'll use the Comic Characters data set:

In [None]:
comic_characters = pd.read_csv("data/comic_characters.csv", index_col="id")
comic_characters.head(1)

### Example

We'd like to know which publisher uses different 'identity' types for their characters? Have DC characters appeared more publicly? What is the average number of times Marvel Characters appeared known to authorities identity?

In [None]:
comic_characters.groupby(['Identity','publisher'])['appearances'].mean().unstack().fillna(0)

Alternatively, we can use .pivot_table(). For example:

In [None]:
avg_appearance_per_identity = comic_characters.pivot_table(index='Identity', 
                                                          columns='publisher', 
                                                          values='appearances',
                                                          aggfunc='mean')
avg_appearance_per_identity.fillna(0).head()

For .pivot_table(), you need to specify these four arguments:
1. index: the field that will become the index of the output table
2. columns: the field that will become the columns of the output table
3. values: the field to be aggregated/summarized
4. aggfunc: the aggregation operator applied to values, if there are more than 1 entry corresponding to each (index, column) pair, such as "mean", "count", "max"
    

### Let's warm up with a few groupby and pivot_table exercises:

### <font color="magenta">Q4: What is the total number of appearances of characters for each publisher?

In [None]:
# Add your code here

### <font color="magenta">Q5: What is the total number of appearances of characters by each publisher in each year? Output a table.

In [None]:
# Add your code here

### <font color="magenta">Q6: Construct a contingency table of sex and character alignment normalized by all values.

Display the normalized values in percentage (%) format. Use brief sentences to explain your findings.  

Hint: use "normalize=all" in your crosstab statement.  What does normalize do? (read the docs)

In [None]:
# Add your code here

### <font color="magenta">Q7: Create a mosaic plot of character alignment and alive status.

In [None]:
# Add your code here

### <font color="magenta">Q8: Conduct a $\chi^2$ test of ```align``` and ```alive```. Please specify your (null and alternative) hypotheses and explain your findings.

In [None]:
# Add your code here

# BREAK!

# Text Processing I: Basics and Regular Expressions

First, a slideshow.... 

As usual, let's load up some data:

In [None]:
import pandas as pd

In [None]:
reviews = pd.read_csv('data/amazon_food_reviews.zip')

Let's take a really small sample, just so we can experiment with the various 

In [None]:
reviews_sample = reviews.head(10)

In [None]:
reviews_sample

Let's review some basic string functionality from Pandas that can be applied to any Series or Index:

In [None]:
reviews_sample.ProfileName.str.lower()

In [None]:
reviews_sample.ProfileName.str.upper()

In [None]:
reviews_sample.Summary.str.len()

Remember, the ```columns``` attribute of a DataFrame is an Index object, which means that we can use str operators on the column names:

In [None]:
reviews_sample.columns

In [None]:
reviews_sample.columns.str.lower()

Notice that the "User Id" column of the dataframe looks weird:  it has a space in the middle *and* at the end.  Columns that are named like that will invariable trip us up in downstream (i.e. later) analyses, so it's wise to correct them now.  Something like the following can help:

In [None]:
reviews_sample.columns.str.strip().str.lower().str.replace(' ','_')

And we can assign that back to the columns attribute to actually rename the columns:


In [None]:
reviews_sample.columns = reviews_sample.columns.str.strip().str.lower().str.replace(' ','_')

In [None]:
reviews_sample

### Splitting and Replacing Strings

Sometimes, we want to split strings into lists.  We might want to do that with the "summary" column:

In [None]:
reviews_sample.productid.str.split('00')

In [None]:
reviews_sample.productid.str.split('00').str.get(1)

Equivalently:

In [None]:
reviews_sample.productid.str.split('00').str[1]

### Replace (regex time!)

In [None]:
reviews_sample.summary.str.lower().str.replace('dog','health')

In [None]:
reviews_sample.summary.str.lower().str.replace('dog|taffy','health')

### Extracting Substrings

In [None]:
reviews_sample.summary.str.extract(r'(Dog)')

In [None]:
reviews_sample.summary.str.extract(r'(Dog|Taffy)')

In [None]:
reviews_sample.summary.str.extract(r'(Dog|[Tt]affy)')

In [None]:
# returns a Series
reviews_sample.summary.str.extract(r'(Dog|[Tt]affy)', expand = False)

In [None]:
reviews_sample.summary.str.extractall(r'(Dog|[Tt]affy)')

In [None]:
reviews_sample.summary.str.extractall(r'(as)')

### Testing for Strings that Match or Contain a Pattern

In [None]:
reviews_sample.text

In [None]:
pattern = r'[Gg]ood'

In [None]:
reviews_sample.text.str.contains(pattern)

In [None]:
reviews_sample.text.str.match(pattern)

In [None]:
pattern = r'.*[Gg]ood.*'

In [None]:
reviews_sample.text.str.match(pattern)

#### Helpful resources:
- Pandas text documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
- Regex Cheat Sheet: https://regexr.com/

### <font color="magenta">Q9: How many rows from the Amazon Food Reviews data set contain HTML tags in the ```text``` column?</font>

In [None]:
# Add your code here

### <font color="magenta">10: Remove all HTML tags from the Amazon Food Reviews text column and save the results to a column called text_no_html.

In [None]:
# Add your code here

### <font color="magenta">Q11: Replace the following words in the text column with the word 'POSITIVE_ADJ' (denoting positive adjectives) and save the results to a column called ```text_coded```. 
    
In all cases, you should find words that are either all lowercase, all uppercase, or words that start with an uppercase letter with the remaining letters lowercase:
    
good, great, excellent, best, perfect

In [None]:
# Add your code here

### <font color="magenta">Q12: How many rows contain multiple positive adjectives?

In [None]:
# Add your code here