# Data Science Programming Languages- DSAI 1303 
## Course Project: Sentiment Analysis of Twitter Data

Twitter has emerged as a fundamentally new instrument to obtain social measurements. For example, researchers have shown that the "mood" of communication on twitter can be used to predict the stock market. 

In this programming project you will:

* Load and prepare a collected set of twitter data for analysis
* You will estimate the sentiment associated with individual tweets
* You will estimate the sentiment of a particular term

Please keep in mind the following points:
* This assignment is open-ended in several ways. You will need to make some decisions about how to best solve each of the problems mentioned above.
* **It is absolutely fine to discuss your solutions with your classmates but you are not allowed to share code.**
* **Each student must submit their own solution via Google Classroom.**

## Formatting of Twitter Data

Strings in the twitter data prefixed with the letter "u" are unicode strings. For example: `u"This is a string"`.

Unicode is a standard for representing a mach larger variety of characters beyond the roma alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems, etc.).

In most circumstances, you will be able to use a unicode object just like a string.

If you encounter an error involving printing unicode, you can use the [encode](https://docs.python.org/3/library/stdtypes.html#str.encode) method to properly print the international characters. You can find more information about UNICODE and Python 3 [here](https://docs.python.org/3/howto/unicode.html).

# Question 1: Loading and Cleaning Twitter Data [20 points]

In this first part, you will neeed to load a sample of tweets in memory and prepare them for analysis. The tweets are stored in the file `tweets.json`. This file follows the *JSON* format. JSON stands for JavaScript Object Notation. It is a simple format for representing nested structres of data --- lists of lists of dictionaries of lists of ... you get the idea.

Each line in of `tweets.json` represents a message. It is straightforward to convert a JSON string into a Python data structure; there is a library to do so called `json`. Below we will show you how to load the data and how to parse the first line in the `tweets.json` file.

In [177]:
import pandas as pd 
tweets = pd.read_csv(r"C:\Users\best tech\Desktop\مشروع مادة البرمجة الفصل الثاني\course project programming - الورقة1.csv",",",names=["text"])
tweets.head(20)

Unnamed: 0,text
0,@Zu_Noma You fantasize about dating a dentist?
1,At the dentist
2,will you stop raining let me go to my dentist
3,"If you are naughty, I shall bite you. And the ..."
4,@mmpadellan This is the adult take.
5,Dentists do have emergencies.
6,The Dentist Offers Treatments Options Be sure ...
7,@faezmaleksss visit yr dentist every 6 months ...
8,Dentist Lawrence Neville reveals foods to avoi...
9,@TotalRaritrash Will never see here anything l...


In [178]:
clean_datafile = open("clean.txt","w")
import re
def X(x):
    e1 = re.sub(r"@\S*"," ",x)               
    e2 = re.sub(r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"," ",e1)
    word = e2.strip(".,;!%&'#$?")
    k = word.lower()
    return k
text_tweets = tweets["text"].apply(X)
tweet_clean = pd.DataFrame(text_tweets)
clean_datafile.write(str(tweet_clean["text"]))
clean_datafile.close()
text_tweets.head(20)

0                  you fantasize about dating a dentist
1                                       at the dentist 
2         will you stop raining let me go to my dentist
3     if you are naughty, i shall bite you. and the ...
4                                this is the adult take
5                          dentists do have emergencies
6     the dentist offers treatments options be sure ...
7           visit yr dentist every 6 months if you can 
8     dentist lawrence neville reveals foods to avoi...
9       will never see here anything less then the f...
10    i cant be the only dentist who couldnt care le...
11          first dentist to become a chief minister.  
12    they do the same with mexico. americans are us...
13    went to dentist today and his face was so funn...
14    feeling very dentist stuffies w teeth horrorco...
15    omgg that’s so cool if u became a dentist!!! i...
16                    sadly... long line at the dentist
17            dentists will love jake's teeth me

Each entry in `tweets.json`, i.e., each `tweet`, corresponds to a dictionary that contains lots of information about the tweet, the user, the activity related to the tweet (i.e., if it was retweeted or not), the timestamp of the tweet, entities mentioned in the tweet, hashtags used, etc.

You can treat the `tweet` variable from above as a dicitonary and use the `.keys()` command to see the fields associated with the dictionary.

We can select any of the aforemented values of Variable `tweet` by treating it as a dictionary. For example let's select the `text` body of the tweet, the time it was `created_at`, and the `hashtags` it contains.  

As you can see this tweet contains no hashtags. The body of the tweet contains several information that is not necesary for our sentiment analysis task. For example, it contains a comma, a reference to a twitter user and a link to an external website. 

Since this information is not necessary we can remove it. In other words we need to clean our input in order to prepare it for analysis. Next, we show you some basic cleaning operations using **regular expressions**. You can find more information on regular expressions [here](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285).

In [99]:
word2 = []
for line in text_tweets:
    wordlist = line.split()
    for word in wordlist:
        word = word.lower()
        word = word.strip("’.,;%$&#!'?")
        word2.append(word)
        if word.isdigit():
            word2.remove(word) 
        else:    
            print(word)

you
fantasize
about
dating
a
dentist
at
the
dentist
will
you
stop
raining
let
me
go
to
my
dentist
if
you
are
naughty
i
shall
bite
you
and
the
dentist
tells
me
i
have
one
of
the
strongest
bites
he
knows
so
beware
this
is
the
adult
take
dentists
do
have
emergencies
the
dentist
offers
treatments
options
be
sure
that
the
dentist
provides
treatment
options
that
fit
your
specific
needs
and
your
budget
coast
dental
gives
every
patient
a
written
treatment
plan
to
review
before
any
procedure
begins
visit
yr
dentist
every
months
if
you
can
dentist
lawrence
neville
reveals
foods
to
avoid
if
you
want
a
perfect
smile
will
never
see
here
anything
less
then
the
fandoms
dentist
pony
i
cant
be
the
only
dentist
who
couldnt
care
less
about
orthodontics
first
dentist
to
become
a
chief
minister
they
do
the
same
with
mexico
americans
are
usually
complaining
about
mexico
but
they're
constantly
there
at
the
dentist
and
getting
prescriptions
went
to
dentist
today
and
his
face
was
so
funny
i
started
cracking
up

We are providing you with a Python script named `preprocess.py`. The script `preprocess.py` accepts one argument on the command line: a JSON file with tweets (i.e., `tweets.json`). You can run the program like this:

`$ python3 preprocess.py tweets.json`

**There are some parts specified in this script that you need to implement**. The goal of this script is to clean all the tweets in `tweets.json`. Running `preprocess.py` will generate an output file named `clean_tweets.txt` containing **one string per line** containing a clean tweet. The order of the clean tweets in your output file should follow the order of the lines in the original `tweets.json`. Basically, the first line in `clean_tweets.txt` should correspond to the first raw tweet in `tweets.json`, the second line should correspond to the second tweet, and so on. If you perform any sorting or you put the processed data in a dictionary the order will not be preserved. Once again: **The n-th line of `clean_tweets.txt` (the file you will submit) should be a string that represent the clean version of the n-the line in the `tweets.json` (the input file).**

You must provide a line for **every** tweet. If the clean tweet is the empty string then just provide a line with the empty string.

***What to turn in: The file `clean_tweets.txt` output by `preprocess.py` after you have implemented the missing parts in `preprocess.py`.***

# Question 2: Derive the sentiment of each tweet [40 points]

For this part, you will compute the sentiment of each clean tweet in `clean_tweets.txt` based on the sentiment scores of the terms in the tweet. The sentiment of a tweet is equivalent to the sum of the sentiment scores for each term in the clean tweet.

You are provided with a skeleton file `tweet_sentiment.py` which accepts two arguments on the command line: a *sentiment file* and a tweet file like the one you generated in Question 1. You can run the skeleton program like this:

`$ python3 tweet_sentiment.py AFINN-111.txt clean_tweets.txt`

The file `AFINN-111.txt` contains a list of pre-computed sentiment scores. Each line in the file contains a word or phrase phollowed by a sentiment score. Each word or phrase that is found in a tweet but not found in `AFINN-111.txt` should be given a sentiment score of 0. See the file `AFINN-README.txt` for more information.

To use the data in the `AFINN-111.txt` file, you may find it useful to build a dictionary. Note that the `AFINN-111.txt` file format is tab-delimited, meaning that the term and the score are separated by a tab character. A tab character corresponds to the string "\t". The following snipped of code may be useful:

In [100]:
import sys
afinnfile_name = open("AFINN-111.txt")
afinnfile = open("AFINN-111.txt", 'r')
scores = {} # initialize an empty dictionary
for line in afinnfile:
    term, score = line.split("\t") # The file is tab-delimited and "\t" means tab character
    scores[term] = int(score) # Conver the score to an integer. It was parsed as a string.
afinnfile.close()

In [101]:
for i in range(len(word2)):
    if word2[i] not in list(scores.keys()):
        print(word2[i],0)
    else:
        h = str(scores[(word2[i])])
        print(word2[i],scores[(word2[i])])   

you 0
fantasize 0
about 0
dating 0
a 0
dentist 0
at 0
the 0
dentist 0
will 0
you 0
stop -1
raining 0
let 0
me 0
go 0
to 0
my 0
dentist 0
if 0
you 0
are 0
naughty 0
i 0
shall 0
bite 0
you 0
and 0
the 0
dentist 0
tells 0
me 0
i 0
have 0
one 0
of 0
the 0
strongest 2
bites 0
he 0
knows 0
so 0
beware 0
this 0
is 0
the 0
adult 0
take 0
dentists 0
do 0
have 0
emergencies 0
the 0
dentist 0
offers 0
treatments 0
options 0
be 0
sure 0
that 0
the 0
dentist 0
provides 0
treatment 0
options 0
that 0
fit 1
your 0
specific 0
needs 0
and 0
your 0
budget 0
coast 0
dental 0
gives 0
every 0
patient 0
a 0
written 0
treatment 0
plan 0
to 0
review 0
before 0
any 0
procedure 0
begins 0
visit 0
yr 0
dentist 0
every 0
months 0
if 0
you 0
can 0
dentist 0
lawrence 0
neville 0
reveals 0
foods 0
to 0
avoid -1
if 0
you 0
want 1
a 0
perfect 3
smile 2
will 0
never 0
see 0
here 0
anything 0
less 0
then 0
the 0
fandoms 0
dentist 0
pony 0
i 0
cant 0
be 0
the 0
only 0
dentist 0
who 0
couldnt 0
care 2
less 0
about 0
ortho

In [102]:
#SUM of tweets
sentimentfile = open("sum_tweets.txt","w")
sum_tweets = {}
list_notin_doc = []
for key in text_tweets :
    summ = 0
    word = key.split()
    for text in word :
        if text in scores.keys() :
            summ = scores[text] + summ
        else:
            summ = 0 + summ
            list_notin_doc.append(text)
    sum_tweets[key]= summ               
    print(key,"   ",summ) 
    #print(list_notin_doc)
    sentimentfile.write( "%s\n" % summ)                       
sentimentfile.close()                          

  you fantasize about dating a dentist     0
at the dentist      0
will you stop raining let me go to my dentist     -1
if you are naughty, i shall bite you. and the dentist tells me i have one of the strongest bites he knows. so beware     2
  this is the adult take     0
dentists do have emergencies     0
the dentist offers treatments options be sure that the dentist provides treatment options that fit your specific needs and your budget. coast dental gives every patient a written treatment plan to review before any procedure begins.       1
  visit yr dentist every 6 months if you can      0
dentist lawrence neville reveals foods to avoid if you want a perfect smile       5
  will never see here anything less then the fandoms dentist pony       0
i cant be the only dentist who couldnt care less about orthodontics     2
first dentist to become a chief minister.       0
they do the same with mexico. americans are usually complaining about mexico but they're constantly there at the den

Your script should output a file named `sentiment.txt` containing the sentiment of each tweet in the file `clean_tweets.txt`, one numeric sentiment score per line. The first score should correspond to the first tweet, the second score should correspond to the second tweet, and so on. In other words, ** the n-th line of the file you submit should contain only a single number that represents teh score of the n-th tweet in the input file.**

After you have implemented everything the first 10 lines of the generated output of your script should be exactly the same as the next lines:

```
0
0
0
0
0
1
2
-4
0
0
```

***What to turn in: The file `sentiment.txt` after you have verified that it returns the correct answers***

# Question 3: Derive the sentiment of new terms [40 points]

In this part you will create a script that computes the sentiment for terms that **do not** appear in the file `AFINN-111.txt`.

You can think about this problem as follows: We know we can use the sentiment-carrying words in `AFINN-111.txt` to deduce the overall sentiment of a tweet. Once you deduce the sentiment of a tweet, you can work backwards to deduce the sentiment of the non-sentiment carrying words that *do not appear* in `AFINN-111.txt`. For example, if the word *football* always appears in proximity with positive words like *great* and *fun*, then we can deduce that the term *football* itself carried a positive sentiment.

You are provided with a skeleton file `term_sentiment.py` which accepts the same two arguments as `tweet_sentiment.py` and can be executed using the following command:

`$ python3 term_sentiment.py AFINN-111.txt clean_tweets.txt`

Your script should print its output to stdout. Each line of the output should contain a term, followed by a space, followed by a sentiment. That is, each line should be in the format <term:string> <sentiment:float>. For example if you have the pair ("foo", 54.2) in Python, it should appear in the output as: `foo 54.2`.

*The order of your output does not matter.*

***What to turn in: The file `term_sentiment.py` after you have implemented the missing parts.***


In [117]:
lisst = ["you","about","a","at","the","will","me","to","if","are","i","and","have","has","he",
         "so","this","is","do","be","that","your","any","yr","who","where","with","but","at",
         "his","here","up","as","why","am","in","w","rn","i’m","also","when","was","were","is","your","omg","yup","you!","woah!!",
         "8","bro","a1","4x4","for","out","...was","sadly...","…and","&lt;3","you.","then","they","they're","before",
        "don’t","wanna","much,","much","self-important","whole","never","these","next","days","a18","had","up","all","that’s","jake's",
        "year","omgg","y","first","up,","day","between","over","very","one","two","dentists","dentist","your","so","#mercedes","'sadly,","dentist!!!",'"domino']

In [118]:
for n in list_notin_doc :
    if len(n) <= 2  or n in lisst :
        list_notin_doc.remove(n)  
print(list_notin_doc)

['fantasize', 'dating', 'raining', 'let', 'naughty,', 'shall', 'bite', 'tells', 'bites', 'knows.', 'beware', 'adult', 'take', 'emergencies', 'offers', 'treatments', 'options', 'sure', 'provides', 'treatment', 'options', 'specific', 'needs', 'budget.', 'coast', 'dental', 'gives', 'every', 'patient', 'written', 'treatment', 'plan', 'review', 'procedure', 'begins.', 'visit', 'every', 'months', 'can', 'lawrence', 'neville', 'reveals', 'foods', 'see', 'anything', 'less', 'fandoms', 'pony', 'cant', 'only', 'couldnt', 'less', 'orthodontics', 'become', 'chief', 'minister.', 'same', 'mexico.', 'americans', 'usually', 'complaining', 'mexico', 'constantly', 'there', 'getting', 'prescriptions', 'went', 'today', 'face', 'started', 'cracking', 'really', 'make', 'story', 'really', 'gotta', 'serious', 'situation', 'stuffies', 'teeth', 'horrorcore', 'became', 'rooting', 'study', 'fashion', 'long', 'line', 'teeth', 'thinks', 'scene', 'anne', 'alpha', 'omega"', 'foreshadowing', 'literal', 'takinythe', 'f

In [119]:
listt1 = []
for key in  sum_tweets.keys() :
    n = key.split()
    for y in n :
        if y in list_notin_doc :
            listt1.append([y,sum_tweets[key]])
print(listt1)     


[['fantasize', 0], ['dating', 0], ['raining', -1], ['let', -1], ['naughty,', 2], ['shall', 2], ['bite', 2], ['tells', 2], ['bites', 2], ['knows.', 2], ['beware', 2], ['adult', 0], ['take', 0], ['emergencies', 0], ['offers', 1], ['treatments', 1], ['options', 1], ['sure', 1], ['provides', 1], ['treatment', 1], ['options', 1], ['specific', 1], ['needs', 1], ['budget.', 1], ['coast', 1], ['dental', 1], ['gives', 1], ['every', 1], ['patient', 1], ['written', 1], ['treatment', 1], ['plan', 1], ['review', 1], ['procedure', 1], ['begins.', 1], ['visit', 0], ['every', 0], ['months', 0], ['can', 0], ['lawrence', 5], ['neville', 5], ['reveals', 5], ['foods', 5], ['see', 0], ['anything', 0], ['less', 0], ['fandoms', 0], ['pony', 0], ['cant', 2], ['only', 2], ['couldnt', 2], ['less', 2], ['orthodontics', 2], ['become', 0], ['chief', 0], ['minister.', 0], ['same', 0], ['mexico.', 0], ['americans', 0], ['usually', 0], ['complaining', 0], ['mexico', 0], ['constantly', 0], ['there', 0], ['getting', 0]

In [144]:
df = pd.DataFrame((listt1),columns=("word","value of tweets"))
df.set_index("word").head(20)

Unnamed: 0_level_0,value of tweets
word,Unnamed: 1_level_1
fantasize,0
dating,0
raining,-1
let,-1
"naughty,",2
shall,2
bite,2
tells,2
bites,2
knows.,2


In [169]:
word_notindic = open("word.txt","w")
df["prediction"] = np.where(df["value of tweets"] >= 0,1,-1)
df2 = df.groupby("word").sum()
#word_notindic.write(str(df2["prediction"]))
df2.head()

Unnamed: 0_level_0,value of tweets,prediction
word,Unnamed: 1_level_1,Unnamed: 2_level_1
adult,0,1
against,3,1
alpha,3,1
alternative,0,1
americans,0,1
