# Sentiment analysis from Twitter and Content Filtering Reccomendation System

In [5]:
import numpy as np
import pandas as pd

In [6]:
import nltk
#nltk.download('wordnet')

## Sentiment analysis 

For this part we are going to use TextBlob, since using a pretrained sentiment analizer is way more convinient and it also provides good results.

In [1]:
import textblob

In [6]:
pip install textblob
pip install nltk

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.17.1
Note: you may need to restart the kernel to use updated packages.


In [15]:
from textblob import TextBlob

def get_tweet_sentiment(tweet):
        '''
        Utility function to classify sentiment of passed tweet
        using textblob's sentiment method
        '''
        # create TextBlob object of passed tweet text
        analysis = TextBlob(tweet)
        # set sentiment
        if analysis.sentiment.polarity > 0:
            return 'positive'
        elif analysis.sentiment.polarity == 0:
            return 'neutral'
        else:
            return 'negative'

In [108]:
import re #regular expressions

def get_movie_from_tweet(tweet):
    """Utility funcion to get the title of a film from a tweet,
    the title needs to be between quotations for the function to work"""
    return re.findall('"([^"]*)"', tweet)

In [21]:
s = 'I really liked "sompoy"'
print(get_movie_from_tweet(s))

['sompoy']


In [18]:
s = 'Dune was so dope'
print(get_tweet_sentiment(s))

negative


## Reccomendation System

## Modeling

This reccomendation system uses an item-based approach. Items usually don’t change much, and item based approach often can be computed offline and served without constantly re-training. Whereas User-based approach is way more mutable and dynamic due to the nature of users.

## Loading Dataset

The dataset selected is MovieTweetings (https://github.com/sidooms/MovieTweetings), MovieTweetings is a dataset consisting of ratings on movies that were contained in well-structured tweets on Twitter. I used this dataset because it was more convinient than rescuing well-structured tweets from the Twitter dataframe, mainly because users don't usually follow a common blueprint to write their opinions about movies (like using some type of quotation for the movie title). For the record, any dataset with movies could be used but due to the sheer simplicity of this dataset made it the most suitable for this paper. 

In [31]:
colnames = ['movie_id', 'movie_title', 'genre'] 
movies_df = pd.read_csv('MovieTweetings-master/latest/movies.dat', header=0,encoding='utf-8',names=colnames,sep='::', dtype={
    'movie_id':int,
    'movie_title':str,
    'genre':str})

  return func(*args, **kwargs)


### Data processing

In the following cells we apply some basic operations to get rid on NaN values, change the type from columns from Object to string

In [53]:
movies_df['movie_title'] = movies_df['movie_title'].astype('string')
movies_df['genre'] = movies_df['genre'].astype('string')

In [60]:
movies_df = movies_df.fillna("unknown")

In [61]:
movies_df

Unnamed: 0,movie_id,movie_title,genre
0,10,La sortie des usines Lumière (1895),Documentary|Short
1,12,The Arrival of a Train (1896),Documentary|Short
2,25,The Oxford and Cambridge University Boat Race ...,unknown
3,91,Le manoir du diable (1896),Short|Horror
4,131,Une nuit terrible (1896),Short|Comedy|Horror
...,...,...,...
38012,15711402,Les rois de l&x27;arnaque (2021),Crime|Documentary
38013,15831978,Cash (2021),unknown
38014,15839820,Sompoy (2021),Comedy|Romance
38015,15842076,The Making of &x27;Rocky vs. Drago&x27; (2021),Documentary


It is also needed to change a bit the format of the genre column and movie_title column, getting rid of weird characters to have an easier time working with it. We use NLTK to process the text, specifically we use lemmatization. Lemmatization is the process of converting a word to its base form considering the context.

In [144]:
from nltk.stem import wordnet

lemmatizer = wordnet.WordNetLemmatizer()
genres = movies_df["genre"]
li=[]

for i in range(len(genres)):
    temp = genres[i].lower()
    temp = temp.split("|")
    temp = [lemmatizer.lemmatize(word) for word in temp]
    li.append(" ".join(temp))

In [63]:
titles = movies_df["movie_title"]
lj=[]

for i in range(len(titles)):
    temp = titles[i].split("(")
    temp = temp[0].lower()
    lj.append(temp)

In [64]:
movies_df['movie_title'] = lj

In [65]:
movies_df['genre'] = li

In [68]:
movies_df['movie_title'] = movies_df['movie_title'].astype('string')
movies_df['genre'] = movies_df['genre'].astype('string')

In [69]:
movies_df.dtypes

movie_id        int32
movie_title    string
genre          string
dtype: object

## Creating the Model

The first step is to tokenize the genre column, thus creating a vector for each movie in which we represent the corresponding genre the movie is catalogued. For this purpose we use CountVectorizer, it takes a collection of text documents and converts it to a matrix of token counts, since we didnt use an a-priori dictionary nor an analyzer that does some kind of feature selection,
the number of features will be equal to the vocabulary size found by analyzing the data.

In [70]:
#Finding based on similar movies
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(movies_df["genre"]).toarray()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]], dtype=int64)

In [23]:
print("Count Vector : \n",X)
print("\nNote: First row of above count vector: ",X[0])
print("\nColumns Coresponding to above count vector is :\n",cv.get_feature_names())

Count Vector : 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]]

Note: First row of above count vector:  [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]

Columns Coresponding to above count vector is :
 ['action', 'adult', 'adventure', 'animation', 'biography', 'comedy', 'crime', 'documentary', 'drama', 'family', 'fantasy', 'fi', 'film', 'game', 'history', 'horror', 'music', 'musical', 'mystery', 'news', 'noir', 'reality', 'romance', 'sci', 'short', 'show', 'sport', 'talk', 'thriller', 'tv', 'unknown', 'war', 'western']


In [71]:
output = movies_df.loc[:,['movie_id','movie_title']]
output = output.join(pd.DataFrame(X))
output

Unnamed: 0,movie_id,movie_title,0,1,2,3,4,5,6,7,...,23,24,25,26,27,28,29,30,31,32
0,10,la sortie des usines lumière,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
1,12,the arrival of a train,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
2,25,the oxford and cambridge university boat race,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,91,le manoir du diable,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,131,une nuit terrible,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38012,15711402,les rois de l&x27;arnaque,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
38013,15831978,cash,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
38014,15839820,sompoy,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
38015,15842076,the making of &x27;rocky vs. drago&x27;,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### Cosine similarity

Cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their lengths.

In [72]:
#Row corresponds to a movie name
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(X) 
#Each row of matrix coressponds to similarity of a movie with all other movies (row len = 10329)

In [26]:
print(similarities)

[[1.         1.         0.         ... 0.         0.70710678 0.        ]
 [1.         1.         0.         ... 0.         0.70710678 0.        ]
 [0.         0.         1.         ... 0.         0.         1.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.70710678 0.70710678 0.         ... 0.         1.         0.        ]
 [0.         0.         1.         ... 0.         0.         1.        ]]


## Application

In [104]:
print('sompoy ' in output['movie_title'].values)

True


In [127]:
print(get_movie_by_id(15831978))

cash 


In [138]:
def check_movie_df(movie):
    return movie in output['movie_title'].values
def get_movie_id(movie):
    return output.loc[output['movie_title'] == movie].movie_id
def get_movie_index(movie):
    return output.loc[output['movie_title'] == movie].index
def get_movie_by_index(idx):
    return output.movie_title[idx]
def get_movie_by_id(mv_id):
    return output.loc[output['movie_id']==mv_id,['movie_title']].values[0][0]

In [143]:
tweet = input("Write an opinion about a movie: ")
movie_title = get_movie_from_tweet(tweet)[0].lower() + " "
if check_movie_df(movie_title):
    sentiment = get_tweet_sentiment(tweet)
    if sentiment == 'positive':
        movie_index = get_movie_index(movie_title)
        movie_id = get_movie_id(movie_title)
        similarity_values = pd.Series(similarities[movie_index][0])
        similarity_values.sort_values(ascending=False)
        similar_movie_indexes = list(similarity_values.sort_values(ascending=False).index)
        similar_movie_indexes.remove(movie_index)
        
        print("Since u watched --->",get_movie_by_id(movie_id.values[0]),"<--- We recommend you")
        for i in range(15):
            print(get_movie_by_index(similar_movie_indexes[i]))
    else:
        print("No te recomendaré peliculas parecidas")
else:
    print("La película no esta en el sistema")

Write an opinion about a movie: i liked "sompoy"
Since u watched ---> sompoy  <--- We recommend you
101 reykjavík 
loser 
miss partners 
celal ile ceren 
let my people go! 
reinas 
just friends 
buying the cow 
keinohrhasen 
the kissing booth 2 
the food guide to love 
she remembers, he forgets 
chacun cherche son chat 
we were dancing 
left right and centre 
