In case you didn't download the following language model from `SpaCy` yet, you may want to do so. Please not that this is not the best model (its geared towards efficiency rather than accuracy). The larger model can also be used (in the case of English language: `en_core_web_trf`. For instructions on installing `SpaCy`, see [here](https://spacy.io/usage)

In [48]:
!python3 -m spacy download en_core_web_sm 

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 3.1 MB/s eta 0:00:01
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Load packages

In [49]:
from glob import glob
import random
import nltk
from nltk.stem.snowball import SnowballStemmer
import spacy
from nltk.corpus import stopwords
mystopwords = stopwords.words("english")


### Read the data

In [50]:
infowarsfiles = glob('articles/*/Infowars/*')
infowarsarticles = []
for filename in infowarsfiles:
    with open(filename) as f:
        infowarsarticles.append(f.read())

you could take random sample if you want to play around with smaller data:

```python
articles =random.sample(infowarsarticles, 10)
```

In [51]:
print(f"I have imported number {len(infowarsarticles)} docs\n\n\n") #check whether we have some data
print(f"the first entry looks like this:\n\n\n{infowarsarticles[0]}")

I have imported number 2000 docs



the first entry looks like this:


Former Virginia Gov. Terry McAuliffe announced within minutes of Justice Kavanaughs appointment to the Supreme Court McAuliffe that he was predicting doom for millions.

The nomination of Judge Brett Kavanaugh will threaten the lives of millions of Americans for decades to come and will morph our Supreme Court into a political arm of the right-wing Republican Party, McAuliffe said.

Of course the policies of the left literally do kill millions. Abortions kill roughly 700,000 Americans per year. Open borders result in the DUI deaths of roughly 5000 Americans per year by drunk-driving illegal aliens, not to mention the flow of heroin over the southern border reaching epidemic proportions in the U.S., killing roughly 16,000 per year.

Add that together and just the policies of Abortion and Open borders kill 721,000 American citizens per year. That is roughly 7,210,000 American Citizens killed per decade by the policies

### Start with pre-processing in `SpaCy`

It makes sense to start your pre-processing in `SpaCy` (at least, if you want to use that module), as `SpaCy`  expect raw text data that is not yet cleaned or tokenized in some way. 

In the following code block, we'll apply lemmatization and tokenization in one go.

In [52]:
nlp = spacy.load("en_core_web_sm")
lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in infowarsarticles]

Check whether that worked out...

In [53]:
print(f"lemmatized: {lemmatized_articles[0][:10]}\n\n")
print(f"original: {infowarsarticles[0].split()[:10]}")

lemmatized: ['former', 'Virginia', 'Gov.', 'Terry', 'McAuliffe', 'announce', 'within', 'minute', 'of', 'Justice']


original: ['Former', 'Virginia', 'Gov.', 'Terry', 'McAuliffe', 'announced', 'within', 'minutes', 'of', 'Justice']


#### If you want to do more with SpaCy, do that before moving on... 

For example, you could extract the 'person' entities if you like...

In [54]:
#[[token.label_ for token in nlp(art)] for art in infowarsarticles]

def get_ents(x):
    return [(ent.label_, ent) for ent in x.ents if ent.label_ == 'PERSON']

entities = [get_ents(nlp(art)) for art in infowarsarticles]

In [55]:
entities[0]

[('PERSON', Terry McAuliffe),
 ('PERSON', Kavanaughs),
 ('PERSON', Brett Kavanaugh),
 ('PERSON', Citizens)]

### Lower case and remove stopwords

Next, we'll write a function to lowercase the data and remove stopwords

In [56]:

def lower_and_remove_stopwords(x):
    return [i.lower() for i in x if i not in mystopwords] 

clean = [lower_and_remove_stopwords(doc) for doc in lemmatized_articles]

In [57]:
print(f"without stopwords and lowercased: {clean[0][:10]}\n\n")
print(f"original: {infowarsarticles[0].split()[:10]}")

without stopwords and lowercased: ['former', 'virginia', 'gov.', 'terry', 'mcauliffe', 'announce', 'within', 'minute', 'justice', 'kavanaughs']


original: ['Former', 'Virginia', 'Gov.', 'Terry', 'McAuliffe', 'announced', 'within', 'minutes', 'of', 'Justice']


### Create bigrams and/or trigrams if you like

In [58]:
articles_bigrams = [["_".join(tup) for tup in nltk.ngrams(art,2)] for art in clean]
articles_trigrams = [["_".join(tup) for tup in nltk.ngrams(art,3)] for art in clean]

In [59]:
print(f"bigrams: {articles_bigrams[0][:10]}\n\n")
print(f"trigrams: {articles_bigrams[0][:10]}\n\n")
print(f"original: {infowarsarticles[0].split()[:10]}")

bigrams: ['former_virginia', 'virginia_gov.', 'gov._terry', 'terry_mcauliffe', 'mcauliffe_announce', 'announce_within', 'within_minute', 'minute_justice', 'justice_kavanaughs', 'kavanaughs_appointment']


trigrams: ['former_virginia', 'virginia_gov.', 'gov._terry', 'terry_mcauliffe', 'mcauliffe_announce', 'announce_within', 'within_minute', 'minute_justice', 'justice_kavanaughs', 'kavanaughs_appointment']


original: ['Former', 'Virginia', 'Gov.', 'Terry', 'McAuliffe', 'announced', 'within', 'minutes', 'of', 'Justice']
