# Data cleaning

- Get dataframe of texts from Academy urls 
- Create test corpus to build cleaning function
- Perform first round of data cleaning
    - unwanted symbols
    - make lowercase
    - remove numbers
- Second round of cleaning
    - lemmatisation
    - tokenization
    - remove stop words
- Third round of cleaning
    - stemming
    - create bigrams
- BoW
    
### Once the cleaning steps are combined in a function, we'll run it on the full df of seperated content (i.e. content split by html elements) and full article df (i.e. non-split content)

## 1. Import Academy texts dataframe

In [None]:
import pandas as pd
import pickle


In [None]:
# Importing full df of content seperated by html element

with open('../04_Data/academy_posts.pkl', 'rb') as file:
    df = pickle.load(file)


## 2. Create a subset as the corpus for testing

For now we only need the url and content columns.

In [None]:
df_subset = df[['url','content']].head(10)


In [None]:
df_subset


## 3. Data cleaning round 1
Converting to lower case, get rid of punctuation and numbers

In [None]:
import re
import string


In [None]:
test = df_subset.loc[6].content
test


In [None]:
def cleaning_round1(text):
    '''lowercase, remove punctuation, remove \xa0, remove numbers + words with numbers'''
    
    text = text.lower()
    text = re.sub('-', ' ', text)
    
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\xa0', ' ', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[”“–‘’]', '', text )
    
    return text


In [None]:
test_round1 = cleaning_round1(test)
test_round1


## 4. Data cleaning round 2!

The big guns are coming out: lemmatisation, tokenization, stopword removal.

### 4.1 Using lemmatization to reduce words to their root words

In [None]:
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = stopwords.words('english')


In [None]:
# lemmatisation

def lemmatizer(text):
    nlp = spacy.load('en_core_web_sm')

    text_out = []
    tokens = nlp(text)
    text_out = [" ".join(token.lemma_ for token in tokens)]
    
    text_out = [re.sub('-PRON-', 'i', str(text)) for text in text_out]
    
    return text_out


In [None]:
test_lem = lemmatizer(test_round1)
test_lem


### 4.2 Tokenizing and removing stop words

In [None]:
import gensim
from gensim.utils import simple_preprocess


In [None]:
def remove_stopwords(texts):
    out = [[word for word in simple_preprocess(str(doc))
            if word not in stop_words]
            for doc in texts]
    return out


In [None]:
test_no_stp = remove_stopwords(test_lem)
test_no_stp


## 5. Data cleaninfg round 3!

Stemming and bigrams

### 5.1 Stemming with nltk

In [None]:
from nltk.stem.porter import PorterStemmer
ps  = PorterStemmer()


In [None]:
def stemmer(content):
    ps  = PorterStemmer()

    stemmed = [ps.stem(w) for w in content]
    
    return stemmed


In [None]:
test_stem = stemmer(test_no_stp[0])
test_stem


### 5.2 Creating combined NLP cleaner function

In [None]:
# This step should be run first to update the df, so we can see how the
# content has been transformed by the cleaning

def nlp_cleaner(content):
    text = cleaning_round1(content)
    text = lemmatizer(text)
    text = remove_stopwords(text)
    text = stemmer(text[0])

    return text


In [None]:
test_clean = nlp_cleaner(test)
test_clean


### 5.3 Bigrams

Making this part of the get corpus function

In [None]:
# building bigram models

def bigrams(words, bi_min=15, tri_min=10):
    bigram = gensim.models.Phrases(words, min_count = bi_min)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    
    return bigram_mod


In [42]:
# won't create bigrams for single text, but this is to test output

test_bigram = bigrams(test_clean)
print(test_bigram[test_clean])


['us', 'presid', 'elect', 'joe', 'biden', 'vow', 'rejoin', 'pari', 'agreement', 'first', 'day', 'offic', 'januari', 'us', 'back', 'track', 'signatori', 'unit', 'nation', 'framework', 'convent', 'climat', 'chang', 'ratifi', 'histor', 'treati']


## 6. Get cleaned df
Before creating the corpus, it would be helpful to have a cleaned df (inlc. bigrams)

In [None]:
def clean_df(df):

    df['content'] = df['content'].apply(nlp_cleaner)
    df['content'] = df['content'].apply(lambda x:' '.join(x))
    
    return df 


In [43]:
df_clean = clean_df(df_subset)
df_clean


Unnamed: 0,url,content
0,https://plana.earth/academy/how-sustainable-is...,christma around corner unfortun year mani abl ...
1,https://plana.earth/academy/how-sustainable-is...,start statist carbon footprint cau end year pa...
2,https://plana.earth/academy/how-sustainable-is...,time offic christma parti quiz
3,https://plana.earth/academy/how-sustainable-is...,find sustain christma parti would
4,https://plana.earth/academy/how-sustainable-is...,may merri forc
5,https://plana.earth/academy/how-joe-biden-u-s-...,fifth anniversari pari climat agreement time g...
6,https://plana.earth/academy/how-joe-biden-u-s-...,presid elect joe biden vow rejoin pari agreeme...
7,https://plana.earth/academy/how-joe-biden-u-s-...,follow unit state withdraw pari climat agreeme...
8,https://plana.earth/academy/how-joe-biden-u-s-...,today trump administr offici leav pari climat ...
9,https://plana.earth/academy/how-joe-biden-u-s-...,today trump administr offici leav pari climat ...


## 7. Get the corpus

Combining all data-cleaning steps to create corpus and BoW for LDA model

In [None]:
def get_corpus(df):
    
    words = list((df.content))
    words = [[word for word in nlp_cleaner(doc)]
            for doc in words]
    
    bigram_mod = bigrams(words)
    bigram_set = [bigram_mod[article] for article in words]
    
    id2word = gensim.corpora.Dictionary(bigram_set)
    id2word.compactify()
    
    corpus = [id2word.doc2bow(text) for text in bigram_set]

    return corpus, id2word, bigram_set


In [None]:
corpus, id2word, train_bigram = get_corpus(df_subset)


In [None]:
# check if texts are clean

train_bigram


# 8. Transforming success_urls for LDA model

Using success_urls as it was previously tested. It already went through some cleaning steps.


In [None]:
with open('../04_Data/success_g13.pkl', 'rb') as file:
    success_g13 = pickle.load(file)
    
success_g13.head()

In [None]:
# bigrams are not part of the clean_df process

# success_clean = clean_df(success_urls)
# success_clean

In [None]:
# Takes 10 minutes to run

succ_corpus, succ_id2word, succ_train_bigram = get_corpus(success_g13)

In [None]:
succ_train_bigram

## 9. Running full df through nlp_cleaner

In [45]:
df.head()

Unnamed: 0,url,title,published,content,tag
0,https://plana.earth/academy/how-sustainable-is...,How sustainable is your office Christmas party?,2020-12-18,Christmas is just around the corner! Unfortuna...,p
1,https://plana.earth/academy/how-sustainable-is...,How sustainable is your office Christmas party?,2020-12-18,"Before we start, here are a few statistics on ...",p
2,https://plana.earth/academy/how-sustainable-is...,How sustainable is your office Christmas party?,2020-12-18,It is time for the office Christmas Party Quiz!,h2
3,https://plana.earth/academy/how-sustainable-is...,How sustainable is your office Christmas party?,2020-12-18,Find out how sustainable your Christmas Party ...,p
4,https://plana.earth/academy/how-sustainable-is...,How sustainable is your office Christmas party?,2020-12-18,May the Merry Force be with you!,h2


In [47]:
# Takes about an hour to run

# full_academy_corpus, full_academy_id2word, full_academy_train_bigram = get_corpus(df)