# Natural Language Processing and Recommender systems

## 1. Explain natural language processing in your own words

Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the 
interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

## 2. discuss what is word embedding, lemmatization, stemming

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

Stemming and lemmatization are methods used to analyze the meaning behind a word in a computer world
Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.

## 3. What is TF-IDF?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and 
machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst
a collection of documents (also known as a corpus).

## 4. What do you mean by recommender systems?

A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information 
filtering system that seeks to predict the "rating" or "preference" a user would give to an item.[1][2]

Recommender systems are used in a variety of areas, with commonly recognised examples taking the form of playlist generators for video and music services, 
product recommenders for online stores, or content recommenders for social media platforms and open web content recommenders.

## 5. Compare and Contrast content based vs collaborative recommender systems.

-->The Content-based approach requires a good amount of information about items’ features, rather than using the user’s interactions and feedback. 
They can be movie attributes such as genre, year, director, actor etc. or textual content of articles that can be extracted by applying Natural Language Processing. 
Collaborative Filtering, on the other hand, doesn’t need anything else except the user’s historical preference on a set of items to recommend from, and 
because it is based on historical data, the core assumption made is that the users who have agreed in the past will also tend to agree in the future.  

-->Domain knowledge in the case of Collaborative Filtering is not necessary because the embeddings are automatically learned,
but in the case of a Content-based approach,since the feature representation of the items is hand-engineered to an extent, this technique requires a lot of 
domain knowledge to be fed with.

-->The collaborative filtering model can help users discover new interests and although the ML system might not know the user’s interest in a given item, 
the model might still recommend it because similar users are interested in that item. On the other hand, A Content-based model can only make recommendations 
based on the existing interests of the user and the model hence only has limited ability to expand on the users’ existing interests. 

## 6. Discuss any 3 similarity metrics.

-->Similarity Based Metrics:
1.)Pearson’s correlation -- Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, 
age and blood pressure. Pearson’s correlation coefficient is a measure related to the strength and direction of a linear relationship. We calculate this metric for the vectors x and y in the following way:

2.)Cosine similarity -- The cosine similarity calculates the cosine of the angle between two vectors.Accordingly, the cosine similarity can take on values between -1 and +1. 
If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions,the cosine similarity is -1.

3.)Jaccard similarity -- Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).We can see that the 
Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.

## 7. What are sparse matrices and how do you create them in python?

In [None]:
-- Sparse matrices are commonly used in applied machine learning (such as in data containing data-encodings that map categories to count)
and even in whole subfields of machine learning such as natural language processing (NLP).

In [None]:
Python’s SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. 
The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.

In [None]:
import numpy as np
from scipy.sparse import csr_matrix

# create a 2-D representation of the matrix
A = np.array([[1, 0, 0, 0, 0, 0], [0, 0, 2, 0, 0, 1],\
 [0, 0, 0, 2, 0, 0]])
print("Dense matrix representation: \n", A)

# convert to sparse matrix representation 
S = csr_matrix(A)
print("Sparse matrix: \n",S)

# convert back to 2-D representation of the matrix
B = S.todense()
print("Dense matrix: \n", B)

## 8. Perform negative and positive text classification on nltk movie recommendation dataset, explain each steps performed.

In [4]:
# importing required libraries
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
#from google.colab import drive
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from gensim.models.word2vec import Word2Vec
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import scale
from sklearn.ensemble import GradientBoostingClassifier
import re as regex
from nltk.corpus import movie_reviews


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Checkout\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
!pip install --user -U nltk



In [6]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Checkout\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [7]:
movie_reviews.words()

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

In [8]:
all_words = nltk.FreqDist(movie_reviews.words())

In [9]:
feature_vector = list(all_words)[:4000]

In [10]:
movie_reviews.fileids('pos')

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt',
 'pos/cv005_29443.txt',
 'pos/cv006_15448.txt',
 'pos/cv007_4968.txt',
 'pos/cv008_29435.txt',
 'pos/cv009_29592.txt',
 'pos/cv010_29198.txt',
 'pos/cv011_12166.txt',
 'pos/cv012_29576.txt',
 'pos/cv013_10159.txt',
 'pos/cv014_13924.txt',
 'pos/cv015_29439.txt',
 'pos/cv016_4659.txt',
 'pos/cv017_22464.txt',
 'pos/cv018_20137.txt',
 'pos/cv019_14482.txt',
 'pos/cv020_8825.txt',
 'pos/cv021_15838.txt',
 'pos/cv022_12864.txt',
 'pos/cv023_12672.txt',
 'pos/cv024_6778.txt',
 'pos/cv025_3108.txt',
 'pos/cv026_29325.txt',
 'pos/cv027_25219.txt',
 'pos/cv028_26746.txt',
 'pos/cv029_18643.txt',
 'pos/cv030_21593.txt',
 'pos/cv031_18452.txt',
 'pos/cv032_22550.txt',
 'pos/cv033_24444.txt',
 'pos/cv034_29647.txt',
 'pos/cv035_3954.txt',
 'pos/cv036_16831.txt',
 'pos/cv037_18510.txt',
 'pos/cv038_9749.txt',
 'pos/cv039_6170.txt',
 'pos/cv040_8276.txt',
 'pos/cv041_21113.txt',
 

In [11]:
feature = {}

# One movie review is chosen

review = movie_reviews.words('neg/cv954_19932.txt')
# ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’

for x in range(len(feature_vector)):
 feature[feature_vector[x]] = feature_vector[x] in review
# The words which are assigned ‘True’ are printed
[x for x in feature_vector if feature[x] == True]


[',',
 'the',
 '.',
 'a',
 'and',
 'of',
 'to',
 "'",
 'is',
 'in',
 's',
 '"',
 'it',
 'that',
 '-',
 ')',
 '(',
 'as',
 'with',
 'for',
 'this',
 'film',
 'i',
 'he',
 'but',
 'on',
 'are',
 't',
 'by',
 'be',
 'one',
 'movie',
 'an',
 'who',
 'not',
 'you',
 'from',
 'at',
 'was',
 'have',
 'they',
 'has',
 'all',
 'there',
 'like',
 'so',
 'about',
 'more',
 'what',
 'when',
 'their',
 ':',
 'just',
 'can',
 'if',
 'we',
 'into',
 'only',
 'no',
 'time',
 'story',
 'would',
 'been',
 'much',
 'get',
 'other',
 'do',
 'two',
 'characters',
 'first',
 'see',
 '!',
 'way',
 'because',
 'make',
 'life',
 'off',
 'too',
 'does',
 'had',
 'while',
 'people',
 'over',
 'could',
 'me',
 'scene',
 'bad',
 'my',
 'best',
 'these',
 'don',
 'new',
 'scenes',
 'know',
 'through',
 'great',
 'another',
 'made',
 'end',
 'work',
 'those',
 'down',
 'every',
 'though',
 'better',
 'audience',
 'seen',
 'going',
 'year',
 'performance',
 'same',
 'old',
 'your',
 'years',
 'comedy',
 'funny',
 'ac

In [12]:
# Document is a list of (words of review, category of review)
document = [(movie_reviews.words(file_id),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id)]
document

[(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg'),
 (['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...], 'neg'),
 (['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...], 'neg'),
 (['"', 'quest', 'for', 'camelot', '"', 'is', 'warner', ...], 'neg'),
 (['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...], 'neg'),
 (['capsule', ':', 'in', '2176', 'on', 'the', 'planet', ...], 'neg'),
 (['so', 'ask', 'yourself', 'what', '"', '8mm', '"', '(', ...], 'neg'),
 (['that', "'", 's', 'exactly', 'how', 'long', 'the', ...], 'neg'),
 (['call', 'it', 'a', 'road', 'trip', 'for', 'the', ...], 'neg'),
 (['plot', ':', 'a', 'young', 'french', 'boy', 'sees', ...], 'neg'),
 (['best', 'remembered', 'for', 'his', 'understated', ...], 'neg'),
 (['janeane', 'garofalo', 'in', 'a', 'romantic', ...], 'neg'),
 (['and', 'now', 'the', 'high', '-', 'flying', 'hong', ...], 'neg'),
 (['a', 'movie', 'like', 'mortal', 'kombat', ':', ...], 'neg'),
 (['she', 'was', 'the', 'femme', 'in', 

In [13]:
def find_feature(word_list
    feature = {}
    for x in feature_vector:
         feature[x] = x in word_list
         return feature

find_feature(document[0][0])
feature_sets = [(find_feature(word_list),category) for (word_list,category) in document]

In [14]:
# The necessary packages and classifiers are imported
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn import model_selection

In [15]:
# Splitting into training and testing sets
train_set,test_set = model_selection.train_test_split(feature_sets,test_size = 0.25)
model = SklearnClassifier(SVC(kernel = 'linear'))
model.train(train_set)
accuracy = nltk.classify.accuracy(model, test_set)
print('SVC Accuracy : {}'.format(accuracy))

SVC Accuracy : 0.48


## 9. Perform content based movie recommendation on the dataset given and explain each steps in detail.

In [16]:
# loading sms data
df = pd.read_csv('movies_metadata.csv', encoding='latin-1')
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,Ø±Ú¯ Ø®ÙØ§Ø¨,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'ÙØ§Ø±Ø³Û'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [17]:
# Use your judgement to preprocess data

#Removing stop words
df['overview'] = df['overview'].str.lower()
stop_words = set(stopwords.words('english'))
df['pre_processed_overview'] = df['overview'].apply(lambda x: ' '.join([word for word in str(x).split() if word not in (stop_words)]))

#Removing punctuations
for remove in map(lambda r: regex.compile(regex.escape(r)), ["(",")","'",","]):df["pre_processed_overview"].replace(remove, "", inplace=True)

df['pre_processed_overview'].head(5)


0    led woody andys toys live happily room andys b...
1    siblings judy peter discover enchanted board g...
2    family wedding reignites ancient feud next-doo...
3    cheated on mistreated stepped on women holding...
4    george banks recovered daughters wedding recei...
Name: pre_processed_overview, dtype: object

In [None]:
# Construct TF-IDF matrix
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform([x for x in df['pre_processed_overview']])
# Compute cosine similarity score between movies
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix)

In [None]:
# Take movie title as input and output 10 most similar movies
input_title = 'Grumpy Old Men'
idx = df.index[df.original_title == input_title][0]

similar_movies = list(enumerate(cosine_sim[idx]))
sorted_list = sorted(similar_movies, key=lambda x:x[1], reverse=True)
i=0
for m in sorted_list:
    print(f'{i}. {df[df.index == m[0]].original_title.values[0]}');
    i = i + 1
    if i>10:
        break