# Citation

Much of the code and examples are copied/modified from 

> Blueprints for Text Analytics Using Python by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler (O'Reilly, 2021), 978-1-492-07408-3.
>

- https://github.com/blueprints-for-text-analytics-python/blueprints-text
- https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb

---

# Setup

In [1]:
%matplotlib inline

import os
from pathlib import Path
import helpsk as hlp
import numpy as np
import pandas as pd

from helpers.utilities import Timer, get_logger

def get_project_directory():
    return os.getcwd().replace('/source/executables', '')

print(get_project_directory())

/Users/shanekercheval/repos/nlp-template


---

# Exploratory Data Analysis

This section provides a basic exploration of the text and dataset.

In [2]:
with Timer("Loading Data"):
    path = os.path.join(get_project_directory(), 'artifacts/data/processed/un-general-debates-blueprint.pkl')
    un_debates = pd.read_pickle(path)

Started: Loading Data
Finished (1.00 seconds)


---

In [3]:
hlp.pandas.numeric_summary(un_debates)

Unnamed: 0,# of Non-Nulls,# of Nulls,% Nulls,# of Zeros,% Zeros,Mean,St Dev.,Coef of Var,Skewness,Kurtosis,Min,10%,25%,50%,75%,90%,Max
session,7507,0,0.0%,0,0.0%,49.6,12.9,0.3,-0.2,-1.1,25,31.0,39.0,51.0,61.0,67.0,70
year,7507,0,0.0%,0,0.0%,1994.6,12.9,0.0,-0.2,-1.1,1970,1976.0,1984.0,1996.0,2006.0,2012.0,2015
num_tokens,7507,0,0.0%,0,0.0%,1480.3,635.2,0.4,1.1,1.7,187,793.6,1005.5,1358.0,1848.0,2336.4,5688
text_length,7507,0,0.0%,0,0.0%,17967.3,7860.0,0.4,1.1,1.8,2362,9553.8,12077.0,16424.0,22479.5,28658.2,72041


In [4]:
hlp.pandas.non_numeric_summary(un_debates)

Unnamed: 0,# of Non-Nulls,# of Nulls,% Nulls,Most Freq. Value,# of Unique,% Unique
country,7507,0,0.0%,ALB,199,2.7%
country_name,7507,0,0.0%,Albania,199,2.7%
speaker,7507,0,0.0%,<unknown>,5429,72.3%
position,7507,0,0.0%,<unknown>,114,1.5%
text,7507,0,0.0%,33: May I first convey to our [...],7507,100.0%
tokens,7507,0,0.0%,"['may', 'first', 'convey', 'pr[...]",7507,100.0%


---

In [None]:
un_debates[un_debates['speaker'].str.contains('Bush')]['speaker'].value_counts()