# Dummy Movie Review Dataset to demonstrate main NLP concepts to get you started with text analysis using Python

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
#reading in the data

df = pd.read_excel('descricao_e_cae.xlsx')
#movie_df = pd.read_csv('movies.csv')

In [3]:
df

Unnamed: 0,nproj,cae5
0,1543,24540 - Fundição de outros metais não ferrosos
1,1543,24540 - Fundição de outros metais não ferrosos
2,1550,20160 - Fabricação de matérias plásticas sob f...
3,1550,20160 - Fabricação de matérias plásticas sob f...
4,1561,33200 - Instalação de máquinas e de equipament...
...,...,...
5946,48467,
5947,48467,
5948,48467,
5949,48467,


Oh no - when copy-pasting the reviews into a csv file, single paragraphs in each review ended up in different columns. And it seems like we also have a duplicate.

## Removing Duplicates

In [4]:
#let's remove duplicates first
df['dup'] = df.duplicated(subset=None, keep='first')

In [5]:
df.head()

Unnamed: 0,nproj,cae5,dup
0,1543,24540 - Fundição de outros metais não ferrosos,False
1,1543,24540 - Fundição de outros metais não ferrosos,True
2,1550,20160 - Fabricação de matérias plásticas sob f...,False
3,1550,20160 - Fabricação de matérias plásticas sob f...,True
4,1561,33200 - Instalação de máquinas e de equipament...,False


We have created a new column that stores a boolean value whether the row is a duplicate row or not. We can see that for the second Sherlock Holmes the value in that column is True. We want to delete those rows where dup == True

In [6]:
df = df[df['dup'] == False]

In [7]:
df

Unnamed: 0,nproj,cae5,dup
0,1543,24540 - Fundição de outros metais não ferrosos,False
2,1550,20160 - Fabricação de matérias plásticas sob f...,False
4,1561,33200 - Instalação de máquinas e de equipament...,False
6,1563,72110 - Investigação e desenvolvimento em biot...,False
10,1567,20600 - Fabricação de fibras sintéticas ou art...,False
...,...,...,...
5933,49277,,False
5937,49256,,False
5939,48626,,False
5943,48566,,False


In [8]:
del df['dup'] # deleting "dup" column since we don't need it anymore

## Merging Columns

let's use the code we found on stackoverflow in order to merge the 3 review columns into one

In [9]:

df['full_review'] = df[df.columns[2:]].apply(
    lambda x: ' '.join(x.dropna().astype(str)),
    axis=1
)



### List slicing and lambda functions in python
Ok, a lot is going on in the function abovebut let's break it down
1. we are creating a new column (full_review) which will contain the whole review 
2. we select all columns starting from columns 2 (review) until the end. That's what [2:] is doing: starting from column 3 (column 0 being the first) and select all the columns until the end, hence no ending slice. 
3. we are applying a lambda function to selected cell (x). A lambda function is a small anonymous function and comes handy when we are doing an operation only once and do not need to define a separate function for it
4. the function iterates through each cell in the row starting at column 3 and joins it with the subsequent cells as a type string. If there cell is empty we ignore it (dropna()) otherwise we would have "na" added at the end of the review text. The reason why we are expresively defining that the cell should be of type string is because some entries might not have an incident text (all NANs) and would be picked up by python as a float type. 
5. the function goes through each row in the dataframe and does the merging described in 4.


In [10]:
df.head()

Unnamed: 0,nproj,cae5,full_review
0,1543,24540 - Fundição de outros metais não ferrosos,
2,1550,20160 - Fabricação de matérias plásticas sob f...,
4,1561,33200 - Instalação de máquinas e de equipament...,
6,1563,72110 - Investigação e desenvolvimento em biot...,
10,1567,20600 - Fabricação de fibras sintéticas ou art...,


In [11]:
df.iloc[0,2]

''

## Preprocessing

Great, it has worked. Let's delete the redundant columns and do some text preprocessing

In [12]:
cols = [0,2]  #column indexes we dont need
df.drop(df.columns[cols],axis=1,inplace=True)

In [21]:
df['cae5']

0          24540 - Fundição de outros metais não ferrosos
2       20160 - Fabricação de matérias plásticas sob f...
4       33200 - Instalação de máquinas e de equipament...
6       72110 - Investigação e desenvolvimento em biot...
10      20600 - Fabricação de fibras sintéticas ou art...
                              ...                        
5933                                                  NaN
5937                                                  NaN
5939                                                  NaN
5943                                                  NaN
5945                                                  NaN
Name: cae5, Length: 1593, dtype: object

In [22]:
import re
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer
stop_words_file = 'SmartStoplist.txt'

stop_words = []

with open(stop_words_file, "r") as f:
    for line in f:
        stop_words.extend(line.split()) 
        
stop_words = stop_words  

def preprocess(raw_text):
    
    #regular expression keeping only letters 
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split into words -> convert string into list ( 'hello world' -> ['hello', 'world'])
    words = letters_only_text.lower().split()

    cleaned_words = []
    lemmatizer = PorterStemmer() #plug in here any other stemmer or lemmatiser you want to try out
    
    # remove stopwords
    for word in words:
        if word not in stop_words:
            cleaned_words.append(word)
    
    # stemm or lemmatise words
    stemmed_words = []
    for word in cleaned_words:
        word = lemmatizer.stem(word)   #dont forget to change stem to lemmatize if you are using a lemmatizer
        stemmed_words.append(word)
    
    # converting list back to string
    return " ".join(stemmed_words)

In [23]:
test_sentence = "this is a sentence to demonstrate how the preprocessing function works...!"

preprocess(test_sentence)

'sentenc demonstr preprocess function work'

you can see that "sentence" was stemmed to "sentenc", all stop words and punctuation were removed.
Let's apply that function to the incident texts in our movies dataframe

In [26]:
df['prep'] = df['cae5'].apply(preprocess)  


TypeError: expected string or bytes-like object

In [27]:
df.head()

Unnamed: 0,cae5
0,24540 - Fundição de outros metais não ferrosos
2,20160 - Fabricação de matérias plásticas sob f...
4,33200 - Instalação de máquinas e de equipament...
6,72110 - Investigação e desenvolvimento em biot...
10,20600 - Fabricação de fibras sintéticas ou art...


## Most Common Words
In order to get an idea about a dataset, it's useful to have a look at the most common words. Reading through all incident texts is cumbersome and inefficient. Let's extract the most common key words


In [28]:
from collections import Counter
Counter(" ".join(df["prep"]).split()).most_common(10)

KeyError: 'prep'

In [None]:
from wordcloud import WordCloud

In [None]:
#nice library to produce wordclouds
from wordcloud import WordCloud

import matplotlib.pyplot as plt
# if uising a Jupyter notebook, include:
%matplotlib inline

all_words = '' 

#looping through all incidents and joining them to one text, to extract most common words
for arg in df["prep"]: 

    tokens = arg.split()  
      
    all_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 700, height = 700, 
                background_color ='white', 
                min_font_size = 10).generate(all_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (5, 5), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

In [None]:
from nltk.util import ngrams
n_gram = 2
n_gram_dic = dict(Counter(ngrams(all_words.split(), n_gram)))

for i in n_gram_dic:
    if n_gram_dic[i] >= 2:
        print(i, n_gram_dic[i])
    

## That's it (for now)

Given it's only 5 movie reviews there is of course not much else interesting to do with the tools we have covered in the medium article. However, I hope I have covered enough to get you started. Feel free to check out the US Railroad incident notebook in the same github repository as this one. Feel free to copy the preprocessing function and re-use it and any other code you might find useful