## Lab NLP


# Challenge 1 - Installations

In [1]:
text = '''Ironhack is a Global Tech School ranked num 2 worldwide.
        Our mission is to help people transform their careers and join a thriving 
        community of tech professionals that love what they do. This ideology is reflected 
        in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI 
        design or Data Analytics course as well as a one-week hiring fair aimed at helping our 
        students change their career and get a job straight after the course. 
        We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, 
        Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.'''

In [2]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/rodrigo/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/rodrigo/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/rodrigo/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/rodrigo/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/rodrigo/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[

True

In [3]:
from nltk.corpus import brown

In [4]:
brown.words()[0:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [5]:
from nltk import sent_tokenize, word_tokenize

In [6]:
text = 'Ironhack is a Global Tech School ranked num 2 worldwide. Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do. This ideology is reflected in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI design or Data Analytics course as well as a one-week hiring fair aimed at helping our students change their career and get a job straight after the course. We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.'

In [7]:
sent_tokenize(text)

['Ironhack is a Global Tech School ranked num 2 worldwide.',
 'Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do.',
 'This ideology is reflected in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI design or Data Analytics course as well as a one-week hiring fair aimed at helping our students change their career and get a job straight after the course.',
 'We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.']

In [8]:
word_tokenize(text)[:10]

['Ironhack',
 'is',
 'a',
 'Global',
 'Tech',
 'School',
 'ranked',
 'num',
 '2',
 'worldwide']

## Challenge 2 - Preparing Text Data For Analysis

In [9]:
import re

In [10]:
def clean_up(s):
    """
    Cleans up numbers, URLs, and special characters from a string.

    Args:
        s: The string to be cleaned up.

    Returns:
        A string that has been cleaned up.
    """
    s = s.lower().strip()
    s=re.sub('http.+\s'," ",s)
    return " ".join(re.findall('[a-z]+', s))

In [11]:
s = clean_up("@Ironhack's-#Q website 776-is http://ironhack.com [(2018)]")

## Tokenization

In [12]:
def tokenize(s):
    """
    Tokenize a string.

    Args:
        s: String to be tokenized.

    Returns:
        A list of words as the result of tokenization.
    """
    return word_tokenize(s)

In [13]:
s = tokenize(s)

## Stemming and Lemmatization

In NLTK, there are three stemming libraries: Porter, Snowball, and Lancaster. The difference among the three is the agressiveness with which they perform stemming. Porter is the most gentle stemmer that preserves the word's original form if it has doubts. In contrast, Lancaster is the most aggressive one that sometimes produces wrong outputs. And Snowball is in between. **In most cases you will use either Porter or Snowball**.


In [14]:
from nltk.stem import WordNetLemmatizer, PorterStemmer
lemmatizer = WordNetLemmatizer()
porterstemmer = PorterStemmer()

In [15]:
s

['ironhack', 's', 'q', 'website', 'is']

In [18]:
def stem_and_lemmatize(l):
    """
    Perform stemming and lemmatization on a list of words.

    Args:
        l: A list of strings.

    Returns:
        A list of strings after being stemmed and lemmatized.
    """
    result=[]
    for e in l:
        result.append(lemmatizer.lemmatize(porterstemmer.stem(e)))

    return result

In [19]:
s = stem_and_lemmatize(s)


## Stop Words Removal

Stop Words are the most commonly used words in a language that don't contribute to the main meaning of the texts. Examples of English stop words are i, me, is, and, the, but, and here. We want to remove stop words from analysis because otherwise stop words will take the overwhelming portion in our tokenized word list and the NLP algorithms will have problems in identifying the truely important words.

NLTK has a stopwords package that allows us to import the most common stop words in over a dozen langauges including English, Spanish, French, German, Dutch, Portuguese, Italian, etc. These are the bare minimum stop words (100-150 words in each language) that can get beginners started. Some other NLP packages such as stop-words and wordcloud provide bigger lists of stop words.

Now in your Jupyter Notebook, create a function called remove_stopwords that loop through a list of words that have been stemmed and lemmatized to check and remove stop words. Return a new list where stop words have been removed.


In [24]:
from spacy.lang.en.stop_words import STOP_WORDS

In [25]:
def remove_stopwords(l):
    """
    Remove English stopwords from a list of strings.

    Args:
        l: A list of strings.

    Returns:
        A list of strings after stop words are removed.
    """
    result=[]
    for e in l:
        if e not in STOP_WORDS:
            result.append(e)
    return result

In [27]:
s = remove_stopwords(s)

## Challenge 3: Sentiment Analysis


## Creating Bag of Words

The purpose of this step is to create a bag of words from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. corpus) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In your Jupyter Notebook, combine all the words in text_processed and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's FreqDist class (documentation). Then select the top 5,000 words from the frequency distribution.



## Testing Naïve Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling nltk.classify.accuracy(classifier, test).

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!



# Bonus Question 1 & 2: Improve Model Performance & Machine Learning Pipeline

If you are still not exhausted so far and want to dig deeper, try to improve your classifier performance. There are many aspects you can dig into, for example:

Improve stemming and lemmatization. Inspect your bag of words and the most important features. Are there any words you should furuther remove from analysis? You can append these words to further remove to the stop words list.

Remember we only used the top 5,000 features to build model? Try using different numbers of top features. The bottom line is to use as few features as you can without compromising your model performance. The fewer features you select into your model, the faster your model is trained. Then you can use a larger sample size to improve your model accuracy score.

In a new Jupyter Notebook, combine all your codes into a function (or a class). Your new function will execute the complete machine learning pipeline job by receiving the dataset location and output the classifier. **This will allow you to use your function to predict the sentiment of any tweet in real time**.
