# Data Analysis

### An NLP Project about Language Usage in Math Textbooks by Tommy Xu (Part 3)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical Differences

### Math Textbooks vs Random Google Searches

### Math Textbooks vs Other Textbooks

In [4]:
USHistory_openBC = pd.read_csv("src/scripts/textbook_data/USHistory_OpenBCEducation_LetterData.csv").set_index("Letter")
AmericanGov_openBC = pd.read_csv("src/scripts/textbook_data/AmericanGovernment2e_OpenBCEducation_LetterData.csv").set_index("Letter")
BusinessLaw_openBC = pd.read_csv("src/scripts/textbook_data/BusinessLawIEssentials_OpenBCEducation_LetterData.csv").set_index("Letter")
MacroEcon_openBC = pd.read_csv("src/scripts/textbook_data/PrinciplesofMacroeconomics2e_OpenBCEducation_LetterData.csv").set_index("Letter")
MicroEcon_openBC = pd.read_csv("src/scripts/textbook_data/PrinciplesofMicroeconomics2e_OpenBCEducation_LetterData.csv").set_index("Letter")
EvalOfSexViolTraining_openBC = pd.read_csv("src/scripts/textbook_data/EvalutatingSexualViolenceTraining_OpenBCText_LetterData.csv").set_index("Letter")

### Harder and Harder Math

In [3]:
## Textbooks from OpenBC, setting df index to Letter
elementary_openBC      = pd.read_csv("src/scripts/textbook_data/ElementaryAlgebra2e_OpenBCEducation_LetterData.csv").set_index("Letter")
intermediateAlg_openBC = pd.read_csv("src/scripts/textbook_data/IntermediateAlgebra2e_OpenBCEducation_LetterData.csv").set_index("Letter")
trig_openBC            = pd.read_csv("src/scripts/textbook_data/AlgebraAndTrigonometry_OpenBCEducation_LetterData.csv").set_index("Letter")
prealgebra_openBC      = pd.read_csv("src/scripts/textbook_data/PreAlgebra2e_OpenBCEducation_LetterData.csv").set_index("Letter")
precalculus_openBC     = pd.read_csv("src/scripts/textbook_data/PreCalculus_OpenBCEducation_LetterData.csv").set_index("Letter")
calculus1_openBC       = pd.read_csv("src/scripts/textbook_data/CalculusVolume1_OpenBCEducation_LetterData.csv").set_index("Letter")
calculus2_openBC       = pd.read_csv("src/scripts/textbook_data/CalculusVolume2_OpenBCEducation_LetterData.csv").set_index("Letter")
calculus3_openBC       = pd.read_csv("src/scripts/textbook_data/CalculusVolume3_OpenBCEducation_LetterData.csv").set_index("Letter")

In [5]:
df_openBC_harderMath = pd.concat([prealgebra_openBC, elementary_openBC, intermediateAlg_openBC, trig_openBC, 
                                  precalculus_openBC, calculus1_openBC, calculus2_openBC, calculus3_openBC], axis = 1)
df_openBC_harderMath.columns = ["Pre-Algebra", "Elementary Algebra", "Intermediate Algebra", "Trigonometry", 
                                "Pre-Calculus", "Calculus 1", "Calculus 2", "Calculus 3"]

for column in df_openBC_harderMath:
    df_openBC_harderMath[column] = df_openBC_harderMath[column] / df_openBC_harderMath[column].sum()
    
df_openBC_harderMath = df_openBC_harderMath.transpose()

df_openBC_harderMath

Letter,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
Pre-Algebra,0.073456,0.015732,0.038242,0.030131,0.12752,0.02092,0.015405,0.034115,0.079164,0.000488,...,0.004112,0.065573,0.069313,0.095178,0.029337,0.012829,0.014906,0.011528,0.018168,0.0013
Elementary Algebra,0.075723,0.015458,0.034583,0.026819,0.121459,0.020667,0.014186,0.033685,0.078827,0.000671,...,0.008416,0.062655,0.070085,0.094597,0.029903,0.012335,0.014181,0.016494,0.022543,0.001277
Intermediate Algebra,0.075416,0.014142,0.036232,0.025987,0.121875,0.022961,0.015436,0.033766,0.080706,0.00065,...,0.008013,0.061361,0.068824,0.092724,0.029653,0.012084,0.013739,0.022805,0.020902,0.001414
Trigonometry,0.074145,0.013246,0.041468,0.025142,0.117046,0.026619,0.018894,0.03362,0.083336,0.000528,...,0.005913,0.061208,0.066931,0.09573,0.029111,0.010499,0.012267,0.021628,0.017089,0.001379
Pre-Calculus,0.072701,0.01233,0.043779,0.024711,0.114801,0.028895,0.018707,0.035125,0.083526,0.000634,...,0.00483,0.059016,0.066722,0.095846,0.030457,0.010813,0.012479,0.022316,0.016278,0.001377
Calculus 1,0.074557,0.01252,0.038707,0.029213,0.112956,0.036444,0.018434,0.039174,0.082611,0.000734,...,0.002046,0.051971,0.058449,0.094058,0.030577,0.015848,0.012304,0.039696,0.014663,0.000788
Calculus 2,0.074159,0.013304,0.039904,0.029836,0.119145,0.026353,0.01839,0.034997,0.078018,0.000648,...,0.003762,0.060449,0.064536,0.09415,0.030215,0.013713,0.011994,0.031093,0.015894,0.000625
Calculus 3,0.071484,0.011896,0.043644,0.034165,0.113921,0.028696,0.016237,0.034869,0.076083,0.002878,...,0.004004,0.061459,0.060048,0.094981,0.03083,0.018628,0.011522,0.025693,0.027837,0.007601


# Correlations and Calculating Similarity

This section will mainly target any inter-correlations between letter usage across all books. It was hypothesized in Notebook [2] that perhaps letters **that are used with high frequency in math textbooks like f, x, y, z** may have a *positive* correlation with each other, and a *negative* correlation with the **letters used with high frequency is non-math textbooks, like i, e, m, t, or s**. 

Additionally another goal of this section is to find the best way to **develop a similarity formula for calculating how alike two different letter usage patterns are to each other**.

# Classification

This last section will be aiming to develop a machine learning classification algorithm (with decent accuracy) to **use letter usage patterns** to classify:
- Whether a textbook is about math or not about math
- Whether a text is a textbook or not a textbook