<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Module-1-Assignment/blob/main/M1_Assignment_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# M1 Assignment: Text Processing and Edit Distance 

In this section we will be exploring how to preprocess tweets . We will provide a function for preprocessing tweets during this week's assignment, but it is still good to know what is going on under the hood. By the end of this assignment, you will see how to use the [NLTK](http://www.nltk.org) package to perform a preprocessing pipeline for Twitter datasets.

## Setup
Eventually, you will conduct a sentiment analysis on Tweets. To help with that, we will be using the Natural Language Toolkit (NLTK) package, an open-source Python library for natural language processing. In this library you will use the NLTK to assist with processing the tweet to clean it for interpretation. 

In Part 1.1, you will extract a set of Elon Musk tweets. Next in Part 1.2, you will process the tweets using the various processing tasks outlined in the section and Chapter 2 of Jurafsky and Martin. Finally, in Part 1.3, you will create a simple version of a Levensthein distance formula to run the edit distance between two matrices. 

As part of completing the assignment, you will see that there are areas in the note book for you to complete your own coding input. 

It will be look like following: 
```
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
'Some coding activity for you to complete'
### END CODE HERE ###

```

Additionally, you will be using custom media and libraries created and stored for your in your Colab environment.


## Part 1.1: Using tweetsts

---



### Extracting tweets
In this next section, we're going to import Elon Musk's tweets and prepare them to preprocess. Use the following code to connect to the M1 Assignment Repo and download it to your Google Colab Account



In [5]:
#install this if you do not have this already installed
#!pip install gitpython 

import git
import sys


# Clone the GitHub repository
repo_url = 'https://github.com/Natural-Language-Processing-YU/Module-1-Assignment.git'
repo_dir = '/content/m1_repo'  # Specify the directory to clone the repository
git.Repo.clone_from(repo_url, repo_dir)
# Add the cloned repository directory to the import path



<git.repo.base.Repo '/content/m1_repo/.git'>

Use the pandas library to create a dataframe with the csv file full owf twees. 

In [76]:
#extract the tweets
import pandas as pd

tweets = pd.read_csv('/content/m1_repo/data/elonmusk_tweets.csv') #import file
print(tweets.dtypes) #print data types
print(tweets.text) #show tweets from file
df = pd.DataFrame(tweets)



id             int64
created_at    object
text          object
dtype: object
0       b'And so the robots spared humanity ... https:...
1       b"@ForIn2020 @waltmossberg @mims @defcon_5 Exa...
2           b'@waltmossberg @mims @defcon_5 Et tu, Walt?'
3                     b'Stormy weather in Shortville ...'
4       b"@DaveLeeBBC @verge Coal is dying due to nat ...
                              ...                        
2814                 b'That was a total non sequitur btw'
2815    b'Great Voltaire quote, arguably better than T...
2816    b'I made the volume on the Model S http://t.co...
2817    b"Went to Iceland on Sat to ride bumper cars o...
2818    b'Please ignore prior tweets, as that was some...
Name: text, Length: 2819, dtype: object


## Part 1.2: Preprocessing the text from tweets
Text processing is one of the critical steps in an NLP project and in data scenience and analytics. It includes cleaning and formatting the data before feeding an algorithm. For NLP, the preprocessing steps are comprised of the following tasks:

1. Tokenizing the string
2. Lowercasing
3. Removing stop words and punctuation
4. Stemming

We will take this approach with a selected tweet that we returned from above see how this is transformed by each preprocessing step.

Let's take one of the tweets and apply preprocessing steps

Let's take a look at one tweet from our dataframe:

In [68]:
print('\033[92m' + df['text'][2816])

[92mI made the volume on the Model S  go to 11.  Now I just need to work in a miniature Stonehenge...'


### Remove hyperlinks, hashtags, and beginngin of strings
Using this tweet, let's clean it up to remove unncessary information. First, we will use regex to remove hyperlinks. You will create the regex substring to remove hyperlinks, hashtabs and the start of a tweet line.  

In [70]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
#create regex for hyperlinks
regex_remove_hyperlinks = None
df['text'][2816] = re.sub(regex_remove_hyperlinks, '', df['text'][2816])

# remove hashtags
# only removing the hash # sign from the word
regex_hash = None
df['text'][2816] = re.sub(regex_hash, '', df['text'][2816])


# remove 'b from each string
regex_string_beginning = r"^b['\"]"
df['text'][2816] = re.sub(regex_string_beginning, '', df['text'][2816])

### END CODE HERE ###


# Print the modified tweet
print('\033[92m' + df['text'][2816])



[92mI made the volume on the Model S  go to 11.  Now I just need to work in a miniature Stonehenge...'


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][2816] = re.sub(regex_string_beginning, '', df['text'][2816])


#Using NLTK Libraries

Now let's use the NLTK libraries to remove stopwords, tokenize, and stem the words. 

The Porter stemming algorithm, also known as the Porter stemmer, is a widely used algorithm for stemming words in natural language processing (NLP). It is named after its creator, Martin Porter. The goal of stemming is to reduce words to their base or root form, which helps to normalize variations of a word and reduce the vocabulary size.

In [78]:
import nltk                                # Python library for NLP
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

In [21]:
# download the stopwords from NLTK
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Tokenize the string

To tokenize means to split the strings into individual words without blanks or tabs. In this same step, we will also convert each word in the string to lower case. The [tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual) module from NLTK allows us to do these easily:

In [None]:
# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)

# tokenize tweets
tokenized_tweet = tokenizer.tokenize(str(df['text'][2816]))


print(tokenized_tweet)

### Remove stop words and punctuations

The next step is to remove stop words and miscelleneous punctuations. Stop words are words that do not have semantic meaning to the tweet. There is a library of stopwords built into NLTK. The list provided by NLTK when you run the cells below.

In [54]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english') 

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

We can see that the stop words list above contains some words that could be important in some contexts. 
These could be words like _i, not, between, because, won, against_. In some cases, you may want to update this dictionary of stop words to suit your needs. 

Certain groupings like ':)' and '...'  should be retained when dealing with tweets because they are used to express emotions, but in some cases they should be removed.


In [64]:
df2=df
tweets_clean = []

for word in tokenized_tweet: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

removed stop words and punctuation:


Next, we use porter stemmer to stem words. 

            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

In [None]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
for word in tokenized_tweet:
    stem_word = stemmer.stem(word)  
    tweets_clean.append(stem_word)
print(tweets_clean)

## preprocess_tweet()

As shown above, preprocessing consists of multiple steps before you arrive at the final list of words.  In the week's assignment, write a function called `preprocess_tweet()`. 

Then, use this to iterate through the dataframe and preprocess each tweet into a new column. 

Print your results. 


In [None]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

def preprocess_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    None
    return tweets_clean


### END CODE HERE ###

In [None]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
"""
Use the function to iterate through the original dataframe 'df' and preprocess each tweet into a new column "preprocessed_tweets". Pseudocode below. 

for i in len(df):
  df["preprocessed_tweet"][i] = preprocess_tweet(df["text"][i])


"""

None
### END CODE HERE ###

display(df)


If Elon Musk gets his way, the people of Texas may soon count Tesla as one of its many electric utility options," s… https://t.co/KiD6rAFxeE
preprocessed tweet:
['elon', 'musk', 'get', 'way', 'peopl', 'texa', 'may', 'soon', 'count', 'tesla', 'one', 'mani', 'electr', 'util', 'option', '…']


## Part 1.3 Create a Levensthein Distance Formula

Recall that Edit Distance is the similarity between two words represented numericallly. Levenstehin distance is one of the most common algorithms used in calculating the edit distance between two words. 

Create your own simple Levensthein distance function. Then return the results of the distance between two words: _stemming_ and _lemmatization_.


In [None]:
def leven_dist(string1, string2):
    '''
    input: 
        string1 = the first word in your formula
        string2 = the second word in your formula
    output: 
        levenschtein edit distance
    
    '''
    #if min(i,j) =/= 0 
    if not string1: return len(string2)
    if not string2: return len(string1)
    
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    #because min(i,j) =/= 0 then we min(i,j)
    
    return min(
        
        #part I. calculate the numerical position of letter i-1, j and add 1
        leven_dist(None, None)+1, 
        

        #part II: calculate the numerical position of letter i, j-1 and add 1
        leven_dist(None, None)+1,
        
        # part III: if position i-1, j-1 are not the same letter, then add 1
        leven_dist(None)+None
    )
### END CODE HERE ###



#now run your results

string1 = 'stemming'
string2 = 'lemmatization'
print("Your Levensthein Distance is: ",leven_dist(string1,string2))

### Expected output:
Your Levensthein Distance is: 10

#end of assignment#
                                            
Source: 
Natural Language preprocessing, deeplearning.ai
Twitter API documentation
Wikipedia: Levensthein Distance
Chapter 2 (Jurafsky and Martin)
                        
