# Corpus Mining!

In this notebook, we'll learn:

* How to use pandas, the "Python Data Analysis Library"
* How to make basic visualizations

## From texts to corpora

So far we've worked primarily with individual texts. When we have worked with multiple texts, we've often simply hand-replicated the analysis for each text. The most we've scaled up to automating our analysis on multiple texts is to loop over a list of texts.

Starting with this notebook, we're going to move from text analysis to corpus analysis. We'll still do the same kinds of text analysis we've been doing, but this time on many texts, and in such a way that our data analyses can be informed by the *metadata* we've collected on our corpora.

## Introduction to pandas

[Pandas](https://pandas.pydata.org/) is a very powerful software library for data analysis in Python. At the end of the day, though, think of it as **"Excel for robots"** (or Excel for Python). In other words, pandas lets Python think in terms of rows and columns. The "excel sheet" equivalent for pandas is the **DataFrame**.

Pandas can convert actual Excel files into its own dataframes very easily.

In [None]:
# To use pandas, first import it (it's traditional to import it in this way)
import pandas as pd

In [None]:
# Convert the actual excel file of Harry Potter's metadata to a pandas data frame ("df"_potter)
df_potter = pd.read_excel('../corpora/harry_potter/metadata.xls')

# Show 
df_potter

In [None]:
# Convert the actual excel file of Tropic of Orange's metadata to a pandas data frame
df_tropic = pd.read_excel('../corpora/tropic_of_orange/metadata.xls')

# Show
df_tropic

In [None]:
# @TODO: Open your own pandas dataframe from your own corpus (follow examples of above)
#



## The Dataframe

Like an Excel sheet, a Pandas dataframe lets us access particular rows and columns of the data.

In [None]:
# Here's the entire data frame
df_potter

### Getting columns

In [None]:
# What are the columns again?
df_potter.columns

In [None]:
# Let's get the column for the title
df_potter['title']

In [None]:
# Another way to get a column
df_potter.title

In [None]:
# To make it a list:
list(df_potter['title'])

In [None]:
# To loop over it:
for title in df_potter.title:
    print(title)

In [None]:
# @TODO: Print the column for the narrator in the Tropic of Orange metadata
#



In [None]:
# @TODO: Loop over the title column (or equivalent) in your own metadata, and print out the title
#



### Getting rows

In [None]:
# We can get rows numerically by their index using iloc
df_potter.iloc[0]

In [None]:
# Let's get the last row
df_potter.iloc[-1]

In [None]:
# We can also SET a name for each row, and then use that as an index
df_potter.set_index('fn')

In [None]:
# Get the row for the Goblet of Fire
df_potter.set_index('fn').loc['Goblet of Fire.txt']

In [None]:
# @TODO: Get the row for the 30th chapter in Tropic of Orange
#

df_tropic.iloc[29]

In [None]:
# @TODO: Get the row for the 4th text in your corpus
#



In [None]:
# @TODO: Set the index on Tropic of Orange's dataframe to be the filename
# Then get the row for ch14.txt
#

df_tropic.set_index('fn').loc['ch14.txt']

### Filtering rows/columns

The real power of pandas is that we can do filters quite easily. For instance, what if we want all the rows in the Tropic of Orange spreadsheet which are narrated by a particular character?

In [None]:
# Let's look at the entire Tropic of Orange data again
df_tropic

In [None]:
# Let's look at the narrator column
df_tropic.narrator

In [None]:
# What happens when we ask if the narrator is Emi?

df_tropic.narrator == 'Emi'

In [None]:
# We can call this sequence something
chapter_narrated_by_emi = df_tropic.narrator == 'Emi'

In [None]:
# If we put this sequence of True/Falses inside the bracket notation for the dataframe...
df_tropic[ chapter_narrated_by_emi ]

In [None]:
# ...Then we see only those rows narrated by Emi.

# The above is equivalent to:
df_tropic[ df_tropic.narrator == 'Emi' ]

In [None]:
# We can call this filtered version of the data frame something
df_tropic_emi = df_tropic[ df_tropic.narrator == 'Emi' ]

In [None]:
# We can then get the filenames just for Emi's chapters
df_tropic_emi.fn

In [None]:
# This is useful because then we can loop over an do things just to Emi's files:
for filename in df_tropic_emi.fn:
    # do something here!
    print(filename,'...')

In [None]:
# @TODO: Filter the dataframe to show only Buzzworm's chapters
#

df_tropic[df_tropic.narrator == 'Buzzworm']

In [None]:
# @TODO: Filter the dataframe for 'part_title' as 'Artificial Intelligence'
#



In [None]:
# @TODO: Filter the dataframe for your own corpus in an interesting way
#



### Other ways to filter

In [None]:
# Get all rows NOT narrated by Emi
df_tropic[ df_tropic.narrator != 'Emi' ]

In [None]:
# Get all rows narrated by Emi or Gabriel
narrators_I_want = ['Emi','Gabriel Balboa']

df_tropic[ df_tropic.narrator.isin(narrators_I_want) ]  

In [None]:
# Get all rows NOT narrated by Emi or Gabriel
df_tropic[ ~df_tropic.narrator.isin(narrators_I_want) ]  

### Other dataframe methods

In [None]:
# How many rows and columns are in the dataframe?

df_tropic.shape

In [None]:
# What if I want to see all unique values of a column?

df_tropic.narrator.unique()

In [None]:
# What if I want to see how many times these unique values occur?

df_tropic.narrator.value_counts()

In [None]:
# How often does each setting occur?
df_tropic.setting.value_counts()

In [None]:
# We can sort a dataframe by its values
df_potter.sort_values('year')

In [None]:
# Sort by two columns
df_tropic.sort_values(['narrator','part'])

In [None]:
# Order matters
df_tropic.sort_values(['part','narrator'])

## Combining data + metadata

### Generating data the pandas way

In [None]:
def read(filename):
    with open(filename) as file:
        return file.read()

In [None]:
def is_punct(token):
    if token[0].isalpha():
        return False
    else:
        return True

In [None]:
def tokenize(text,keep_punct=True):
    import nltk
    tokens=nltk.word_tokenize(text.lower())
    
    tokens_nopunct=[]
    for token in tokens:
        # if punctuation?
        if not is_punct(token):
            tokens_nopunct.append(token)
    
    return tokens_nopunct

In [None]:
def count(tokens):
    from collections import Counter
    return Counter(tokens)

In [None]:
def tf(tokens):
    counts=count(tokens)
    num_words=len(tokens)
    for word in counts:
        counts[word] = counts[word] / num_words
    return counts

#### Brief side-note on 'os'

In [None]:
# Operating System (os) functions
import os

In [None]:
# Get current working directory
os.getcwd()

In [None]:
# Print working directory (the terminal way)
!pwd

In [None]:
# The equivalent to ls:
os.listdir('.')

In [None]:
# Set a variable to the folder where the harry potter files are stored
textfolder_potter = '../corpora/harry_potter/texts/'

In [None]:
# The equivalent to ls [some folder]: we get a list of the files
os.listdir(textfolder_potter)

In [None]:
# We can loop over that list
for filename in os.listdir(textfolder_potter):
    print(filename)

In [None]:
# To get the full path, though, we need to join the FOLDER name to the FILE name

# the folder of texts
print(textfolder_potter)

# a filename in that folder
print(filename)

# full path = folder + file
os.path.join(textfolder_potter, filename)   # use os.path.join(folder, file)

#### Back to Pandas

In [None]:
# Remind ourselves what this looks
df_potter

In [None]:
import os
textfolder_potter = '../corpora/harry_potter/texts/'

# make empty results list
results=[]

# loop over the filenames
for fn in df_potter.fn:
    text_path = os.path.join(textfolder_potter, fn)
    
    # read file to string
    text_str = read(text_path)
    
    # tokenize string to list
    text_tokens = tokenize(text_str)
    
    # count list to dictionary
    text_counts = count(text_tokens)
    text_tf = tf(text_tokens)
    
    # make a new result dictionary, always include the filename!
    text_results = {'fn':fn}
    
    # let's add our results
    text_results['count_Harry'] = text_counts.get('harry',0)
    text_results['count_Hermione'] = text_counts.get('hermione',0)
    text_results['count_Ron'] = text_counts.get('ron',0)
    
    text_results['tf_Harry'] = text_tf.get('harry',0)
    text_results['tf_Hermione'] = text_tf.get('hermione',0)
    text_results['tf_Ron'] = text_tf.get('ron',0)
    text_results['tf_Voldemort'] = text_tf.get('voldemort',0)
    
    # print the results
    print(text_results)
    
    # add our results to the master results list
    results.append(text_results)

In [None]:
# Now we can make a dataframe for just our results
df_results = pd.DataFrame(results)

# Show
df_results

In [None]:
# Let's set the index to filename also
df_results = df_results.set_index('fn')

In [None]:
# We can sort our results
df_results.sort_values('tf_Harry')

In [None]:
# We can sort our results
df_results.sort_values('tf_Harry',ascending=False)

In [None]:
# We can sort our results
df_results.sort_values('tf_Hermione',ascending=False)

In [None]:
# We can sort our results
df_results.sort_values('tf_Ron',ascending=False)

### Joining data to metadata

Here's a major payoff of using pandas to store our results. We can join them easily with our metadata!

In [None]:
# Here's our metadata again
df_potter

In [None]:
# Here's our data again
df_results

In [None]:
# Let's join them!
df_potter.join(df_results,on='fn')

In [None]:
# We can create a meta df with both metadata and data
df_all = df_potter.join(df_results,on='fn')

# show
df_all

## Plotting

### Individual line plots

In [None]:
df_all.plot(x='series_num',y='tf_Harry')

In [None]:
df_all.plot(x='series_num',y='tf_Hermione')

In [None]:
df_all.plot(x='series_num',y='tf_Ron')

In [None]:
df_all.plot(x='series_num',y='tf_Voldemort')

### Multiple

In [None]:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()

df_all.plot(x='series_num',y='tf_Harry', ax=ax)
df_all.plot(x='series_num',y='tf_Hermione', ax=ax)
df_all.plot(x='series_num',y='tf_Ron', ax=ax)
df_all.plot(x='series_num',y='tf_Voldemort', ax=ax)

## Classwork

In [None]:
## @TODO: Calculate the words per sentence and commas per sentence for each Harry Potter novel
# and visualize the results.
# 
# Please do not edit the results dataframe above but make a new results dataframe