# Assignments

In this assignment, you're required to clean up the two datasets. You'll be using these datasets in the later checkpoints of this module and hence cleaning them up here will help you save time when working with these datasets.

The first dataset is a dialogue dataset called Cornell Movie--Dialogs Corpus. This corpus includes conversations between the characters of more than 600 movies.

The second dataset is the Twitter US Airline Sentiment dataset from Kaggle. This dataset contains the tweets from travelers about some airlines in February 2015. This dataset is usually used in sentiment analysis but we'll use it for sentence generation later on.

Since the memory requirements of the datasets are relatively large, we recommend you to use Google Colaboratory.

Please submit your solutions to the following tasks as a link to your Jupyter notebook on GitHub.


Submit your work below, and plan on discussing it with your mentor. You can also take a look at this example solution

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import nltk
import spacy
import re

import warnings
warnings.filterwarnings(action="ignore")

#!python -m spacy download en

In [6]:
###  The data is in the table called "dialogs".
### Apply the data preprocessing techniques you learned here to Cornell Movie--Dialogs 
# Corpus data. You'll be using this dataset when developing a chatbot in a later checkpoint. 

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'cornell_movie_dialogs'

In [7]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

dialogs_df = pd.read_sql_query('select * from dialogs',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


dialogs_df.head(10)

Unnamed: 0,index,dialogs
0,0,Can we make this quick? Roxanne Korrine and A...
1,1,"Well, I thought we'd start with pronunciation,..."
2,2,Not the hacking and gagging and spitting part....
3,3,Okay... then how 'bout we try out some French ...
4,4,You're asking me out. That's so cute. What's ...
5,5,Forget it.
6,6,"No, no, it's my fault -- we didn't have a prop..."
7,7,Cameron.
8,8,"The thing is, Cameron -- I'm at the mercy of a..."
9,9,Seems like she could get a date easy enough...


In [9]:
# Note: When parsing the data using SpaCy, you may run into some memory issues even 
# in Google Colaboratory. If you're having memory issues, try parsing your text as follows:
nlp = spacy.load('en', disable=['parser', 'ner'])

# below is necessary to avoid memory error of SpaCy
nlp.max_length = 20000000

# all the processing work is done below, so it may take a while
dialogs_doc = nlp(" ".join(dialogs_df.dialogs))


In [10]:
# let's explore the objects we've built.
print("The dialogs_doc object is a {} object.".format(type(dialogs_doc)))
print("It is {} tokens long".format(len(dialogs_doc)))
print("The first three tokens are '{}'".format(dialogs_doc[:3]))
print("The type of each token is {}".format(type(dialogs_doc[0])))

The dialogs_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 4273815 tokens long
The first three tokens are 'Can we make'
The type of each token is <class 'spacy.tokens.token.Token'>


In [11]:
# removing the stopwords
dialogs_without_stopwords = [token for token in dialogs_doc if not token.is_stop]

In [12]:
# lemmatization
lemmas = [token.lemma_ for token in dialogs_without_stopwords]

In [14]:
print(lemmas)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Apply the data preprocessing techniques you learned here to Twitter US Airline Sentiment data. You'll be using this dataset when generating sentences in a later checkpoint

In [15]:
###  The data is in the table called "twitter".
###Apply the data preprocessing techniques you learned here to Twitter US Airline Sentiment 
###data. You'll be using this dataset when generating sentences in a later checkpoint. 
### You should access the dataset from the Thinkful database using the following credentials:

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'twitter_sentiment'

In [16]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

tweets_df = pd.read_sql_query('select * from twitter',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


tweets_df.head(10)

Unnamed: 0,index,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
6,6,570300616901320704,positive,0.6745,,0.0,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada)
7,7,570300248553349120,neutral,0.634,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada)
8,8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada)
9,9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)


In [17]:
nlp = spacy.load('en', disable=['parser', 'ner'])

# below is necessary to avoid memory error of SpaCy
nlp.max_length = 20000000

# all the processing work is done below, so it may take a while
twitter_doc = nlp(" ".join(tweets_df.text))

In [18]:
# let's explore the objects we've built.
print("The twitter_doc object is a {} object.".format(type(twitter_doc)))
print("It is {} tokens long".format(len(twitter_doc)))
print("The first three tokens are '{}'".format(twitter_doc[:3]))
print("The type of each token is {}".format(type(twitter_doc[0])))

The twitter_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 307328 tokens long
The first three tokens are '@VirginAmerica What @dhepburn'
The type of each token is <class 'spacy.tokens.token.Token'>


In [19]:
# removing the stopwords
tweets_without_stopwords = [token for token in twitter_doc if not token.is_stop]

In [20]:
# lemmatization
lemmas = [token.lemma_ for token in tweets_without_stopwords]