## Some more on ```spaCy``` and ```pandas```

First we want to import some of the packages we need.

In [2]:
import os
import spacy

# Remember we need to initialise spaCy
nlp = spacy.load("en_core_web_sm")

We can inspect this object and see that it's what we've been called a ```spaCy``` object. 

In [3]:
type(nlp)

spacy.lang.en.English

We use this ```spaCy``` object to create annotated outputs, what we call a ```Doc``` object.

In [4]:
example = 'this is an english sentence'
doc = nlp(example)

In [5]:
type(doc)

spacy.tokens.doc.Doc

```Doc``` objects are sequences of tokens, meaning we can iterate over the tokens and output specific annotations that we want such as POS tag or lemma.

In [6]:
for token in doc:
    print(token.text, token.pos_, token.lemma_)

this PRON this
is AUX be
an DET an
english ADJ english
sentence NOUN sentence


__Reading data with ```pandas```__

```pandas``` is the main library in Python for working with DataFrames. These are tabular objects of mixed data types, comprising rows and columns.

In ```pandas``` vocabulary, a column is called a ```Series```, which is like a sophisticated list. I'll be using the names ```Series``` and column pretty interchangably.

In [7]:
import pandas as pd

In [11]:
filepath = os.path.join('..','..','..','CDS-LANG', 'tabular_examples', 'fake_or_real_news.csv')

../../../CDS-LANG/tabular_examples/fake_or_real_news.csv


In [12]:
df = pd.read_csv(filepath)

We can use ```.sample()``` to take random samples of the dataframe.

In [14]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
4650,84,U.S. District Judge orders homophobic Kentucky...,"Kim Davis — the Rowan County, Kentucky Clerk w...",REAL
1706,1831,Fiorina rejects idea of 'affirmative action' i...,"On this day in 1973, J. Fred Buzhardt, a lawye...",REAL
1899,7962,Comments of the Week: Here Comes Trouble,Comments of the Week: Here Comes Trouble Poste...,FAKE
2364,7452,Clinton campaign blames FBI director for loss ...,Politics FBI Director James Comey (AFP file ph...,FAKE
4200,721,Obama says world leaders right to be 'rattled'...,President Obama said world leaders were right ...,REAL
1537,6974,"Tesla, ‘World’s Safest Car,’ Explodes Like a Bomb","Tesla, ‘World’s Safest Car,’ Explodes Like a B...",FAKE
3703,7603,Speaker at Sanders Rally Tells Crowd Not to Vo...,We Are Change \nThe president of the Iowa Stat...,FAKE
1408,345,Cory Booker on how America's criminal justice ...,German Lopez:You've called the war on drugs a ...,REAL
4149,3220,Republicans Are Now Seen As The More Extreme P...,"In a shift of opinion since the 2014 midterms,...",REAL
3520,3733,"Charleston church massacre suspect caught, but...",Police on Thursday nabbed the 21-year-old man ...,REAL


To delete unwanted columns, we can do the following:

In [19]:
del df['Unnamed: 0'] # works to delete columns, but this is a dangerous way of doing it

In [21]:
df['label'].value_counts()

REAL    3171
FAKE    3164
Name: label, dtype: int64

We can count the distribution of possible values in our data using ```.value_counts()``` - e.g. how many REAL and FAKE news entries do we have in our DataFrame?

__Filter on columns__

To filter on columns, we define a condition on which we want to filter and use that to filer our DataFrame. We use the square-bracket syntax, just as if we were slicing a list or string.

In [29]:
fake_news_df = df[df['label'] == 'FAKE']

In [31]:
real_news_df = df[df['label'] == 'TRUE']

Here we create two new dataframes, one with only fake news text, and one with only real news text.

In [32]:
fake_news_df

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
...,...,...,...,...
6326,6143,DOJ COMPLAINT: Comey Under Fire Over Partisan ...,DOJ COMPLAINT: Comey Under Fire Over Partisan ...,FAKE
6328,9337,Radio Derb Is On The Air–Leonardo And Brazil’s...,,FAKE
6329,8737,Assange claims ‘crazed’ Clinton campaign tried...,Julian Assange has claimed the Hillary Clinton...,FAKE
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE


In [36]:
output_data = []

for token in doc:
    output_data.append((token.text, token.pos_, token.lemma_))

In [37]:
output_data

[('this', 'PRON', 'this'),
 ('is', 'AUX', 'be'),
 ('an', 'DET', 'an'),
 ('english', 'ADJ', 'english'),
 ('sentence', 'NOUN', 'sentence')]

In [40]:
pd.DataFrame(output_data, columns = ['text', 'pos', 'lemma'])

Unnamed: 0,text,pos,lemma
0,this,PRON,this
1,is,AUX,be
2,an,DET,an
3,english,ADJ,english
4,sentence,NOUN,sentence


__Counters__

In the following cell, you can see how to use a 'counter' to count how many entries are in a list.

The += operator adds 1 to the variable ```counter``` for every entry in the list.

__Counting features in data__

Using the same logic, we can count how often adjectives (```JJ```) appear in our data. 

This is useful from a lingustic perspective; we could now, for example, figure out how many of each part of speech can be found in our data.

In this case, we're using ```nlp.pipe``` from ```spaCy``` to group the entries together into batches of 500 at a time.

Why?

Everytime we execute ```nlp(text)``` it incurs a small computational overhead which means that scaling becomes an issue. An overhead of 0.01s per document becomes an issue when dealing with 1,000,000 or 10,000,000 or 100,000,000...

If we batch, we can therefore be a bit more efficient. It also allows us to keep our ```spaCy``` logic compact and together, which becomes useful for more complex tasks.

## Sentiment with ```spaCy```

To work with spaCyTextBlob, we need to make sure that we are working with ```spacy==2.3.5```. 

Follow the separate instructions posted to Slack to make this work.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
# initialise spacy
nlp = spacy.load("en_core_web_sm")

Here, we initialise spaCyTextBlob and add it as a new component to our ```spaCy``` nlp pipeline.

Let's test spaCyTextBlob on a single text, specifically Virgian Woolf's _To The Lighthouse_, published in 1927.

We use ```spaCy``` to create a ```Doc``` object for the entire text (how might you do this in batch?)

We can extract the polarity for each sentence in the novel and create list of scores per sentence.

We can create a quick and cheap plot using matplotlib - this is only fine in Jupyter Notebooks, don't do this in the wild!

We can the use some fancy methods from ```pandas``` to calculate a rolling mean over a certain window length.

For example, we group together our polarity scores into a window of 100 sentences at a time and calculate an average on that window.

This plot with a rolling average shows us a 'smoothed' output showing the rolling average over time, helping to cut through the noise.