<a href="https://colab.research.google.com/github/yamac0/IE423/blob/main/task8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initialize

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Movie metadata
dfJks = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/archive/JokeText.csv')

# User ratings for jokes
dfJksRtg1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/archive/UserRatings1.csv')

dfJksRtg2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/archive/UserRatings2.csv')


## Build Recommendations

Let us first try to build a recommender using movie content such as descriptions and taglines, also known as **Content Based Filtering**

### 1. Content Based Filtering

#### Prepare data

In [None]:
dfJks.head()

Unnamed: 0,JokeId,JokeText
0,0,"A man visits the doctor. The doctor says ""I ha..."
1,1,This couple had an excellent relationship goin...
2,2,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,3,Q. What's the difference between a man and a t...
4,4,Q.\tWhat's O. J. Simpson's Internet address? \...


In [None]:
dfJks.shape

(100, 2)

#### Build Model

In [None]:
# Generate a matrix of common terms that show up in each movie

from sklearn.feature_extraction.text import TfidfVectorizer

# Generate a matrix of common terms that show up in each joke
tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=1, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(dfJks['JokeText'])

# Shape of the tfidf_matrix
print(tfidf_matrix.shape)

(100, 3774)


The similarity between any two movies (x) and (y) is defined as the **Cosine Similarity**:
cosine(x,y)=x.y⊺||x||.||y||

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.

In [None]:
# Calculate cosine similarity between each pair of movies as a function of the similarity of the common terms

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(100, 100)

#### Predict

In [None]:
# Prepare recommendation function (build code from scratch and then package as function for ease of understanding)

def get_similar_jokes(joke_id, cosine_sim=cosine_sim):
    sim_scores = list(enumerate(cosine_sim[joke_id]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    joke_indices = [i[0] for i in sim_scores]
    return dfJks['JokeText'].iloc[joke_indices]


similar_jokes = get_similar_jokes(0)
print(similar_jokes)

86    A man, recently completing a routine physical ...
67    A man piloting a hot air balloon discovers he ...
87    A Czechoslovakian man felt his eyesight was gr...
75    There once was a man and a woman that both  go...
31    A man arrives at the gates of heaven. St. Pete...
38    What is the difference between men and women:\...
55    A man and Cindy Crawford get stranded on a des...
80    An Asian man goes into a New York CityBank to ...
32    What do you call an American in the finals of ...
3     Q. What's the difference between a man and a t...
Name: JokeText, dtype: object


In [None]:
similar_jokes = get_similar_jokes(3, cosine_sim)

# Print the original joke and the top 10 similar jokes with a bit more context
print("Original Joke:\n", dfJks['JokeText'].iloc[3], "\n")
print("Top 10 Similar Jokes:")
for i, joke in enumerate(similar_jokes.head(10)):
    print(f"{i+1}. {joke[:100]}...")

Original Joke:
 Q. What's the difference between a man and a toilet? 

A. A toilet doesn't follow you around after you use it.
 

Top 10 Similar Jokes:
1. Q: What is the difference between George  Washington, Richard Nixon,
and Bill Clinton?

A: Washingto...
2. A man piloting a hot air balloon discovers he has wandered off course and
is hopelessly lost. He des...
3. What is the difference between men and women:


A woman wants one man to satisfy her every need.
A m...
4. Q: What's the difference between the government  and  the Mafia?

A: One of them is organized.
...
5. There once was a man and a woman that both  got in  a terrible car wreck. Both of their vehicles  
w...
6. Q: What's the difference between a Lawyer and a Plumber? 
A: A Plumber works to unclog the system.
...
7. A man visits the doctor. The doctor says "I have bad news for you.You have
cancer and Alzheimer's di...
8. What's the difference between a MacIntosh and an
Etch-A-Sketch? 

You don't have to shake the Mac to..

In [None]:
similar_jokes = get_similar_jokes(10, cosine_sim)

# Print the original joke and the top 10 similar jokes with a bit more context
print("Original Joke:\n", dfJks['JokeText'].iloc[10], "\n")
print("Top 10 Similar Jokes:")
for i, joke in enumerate(similar_jokes.head(10)):
    print(f"{i+1}. {joke[:100]}...")

Original Joke:
 Q. What do a hurricane, a tornado, and a redneck
divorce all have in common? 
A. Someone's going to lose their trailer...
 

Top 10 Similar Jokes:
1. Q: What do Monica Lewinsky and Bob Dole have in common?
A: They were both upset when Bill finished f...
2. "May I take your order?" the waiter asked. 

"Yes, how do you prepare your chickens?" 

"Nothing spe...
3. This guys wife asks, "Honey if I died would you remarry?" and he replies,
"Well, after a considerabl...
4. This couple had an excellent relationship going until one day he came home
from work to find his gir...
5. Two Rednecks were seated at the end of a bar when a young lady
seated a few stools up began to choke...
6. There once was a man and a woman that both  got in  a terrible car wreck. Both of their vehicles  
w...
7. A man piloting a hot air balloon discovers he has wandered off course and
is hopelessly lost. He des...
8. One Sunday morning William burst into the living room and said,
"Dad! Mom! I have som

These recommendations suggest movies that are close in name and description. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.  This is a good way of providing recommendations especially when no further data is available.  

What if we also have data on personal tastes?  Can we make recommendations that capture these tastes and recommend movies that are more personalized?  For this, we use a technique called **Collaborative Filtering** which is based on the idea that users similar to me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

utexas_ds_orie_divider_gray.png

### 2. Collaborative Filtering

#### Prepare data

In [None]:
dfJksRtg = pd.concat([dfJksRtg1, dfJksRtg2], ignore_index=True)
dfJksRtg.head(10)

Unnamed: 0,JokeId,User1,User2,User3,User4,User5,User6,User7,User8,User9,...,User73412,User73413,User73414,User73415,User73416,User73417,User73418,User73419,User73420,User73421
0,0,5.1,-8.79,-3.5,7.14,-8.79,9.22,-4.03,3.11,-3.64,...,,,,,,,,,,
1,1,4.9,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,...,,,,,,,,,,
2,2,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,...,,,,,,,,,,
3,3,-4.17,-4.61,-0.1,0.05,8.98,9.27,-6.99,0.49,-3.4,...,,,,,,,,,,
4,4,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,...,,,,,,,,,,
5,5,1.75,-0.78,1.26,6.65,8.25,-8.11,-6.75,2.14,0.34,...,,,,,,,,,,
6,6,4.76,1.6,-5.39,-7.52,4.08,4.42,-0.15,-0.24,-3.01,...,,,,,,,,,,
7,7,3.3,1.07,1.5,7.28,2.52,2.72,-5.87,8.06,-6.65,...,,,,,,,,,,
8,8,-2.57,-8.69,-8.4,-5.15,-9.66,9.08,-3.54,2.82,-3.4,...,,,,,,,,,,
9,9,-1.41,-4.66,4.37,-7.14,2.48,9.13,-5.19,7.52,1.36,...,,,,,,,,,,


#### Build Model

In [None]:
# Prepare data into Surprise library format
# Install the Surprise library
!pip install scikit-surprise

from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split


# Reshape the DataFrame to have columns: user_id, item_id, rating
ratings_melted = dfJksRtg.melt(id_vars=['JokeId'], var_name='user_id', value_name='rating').dropna()
ratings_melted.head()


# Prepare data for the Surprise library
reader = Reader(rating_scale=(-10, 10))  # Adjust the rating scale as per your dataset
data = Dataset.load_from_df(ratings_melted[['user_id', 'JokeId', 'rating']], reader)

# Split the data into training and test sets
X_train, X_test = train_test_split(data, test_size=0.25)



In [None]:
# Define SVD model

from surprise import SVD

mdlSvdJksRtg = SVD()

In [None]:
# Fit SVD model

mdlSvdJksRtg.fit(X_train)
test_pred = mdlSvdJksRtg.test(X_test)

In [None]:
# Evalute SVD accuracy

from surprise import accuracy

accuracy.rmse(test_pred)

RMSE: 4.4467


4.446708452994019

In [None]:
from surprise.model_selection import GridSearchCV

# Define a smaller parameter grid
param_grid = {
    'n_epochs': [5, 10],
    'lr_all': [0.002, 0.005],
    'reg_all': [0.4]
}

# Perform grid search with fewer cross-validation folds
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=2)
gs.fit(data)

# Best RMSE score
print(f"Best RMSE score: {gs.best_score['rmse']}")

# Best parameters
print(f"Best parameters: {gs.best_params['rmse']}")


Best RMSE score: 4.426248092170406
Best parameters: {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [None]:
# Cross-validate

from surprise.model_selection import cross_validate

cross_validate(mdlSvdJksRtg, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.4477  4.4822  4.4721  4.4839  4.4759  4.4724  0.0130  
MAE (testset)     3.4170  3.4444  3.4365  3.4458  3.4387  3.4365  0.0103  
Fit time          70.73   72.71   72.28   72.57   71.62   71.98   0.73    
Test time         15.44   15.66   15.56   11.20   17.27   15.02   2.03    


{'test_rmse': array([4.44771914, 4.48215244, 4.47211318, 4.48389921, 4.47587887]),
 'test_mae': array([3.41699025, 3.44440992, 3.43647587, 3.4457803 , 3.43872756]),
 'fit_time': (70.7347252368927,
  72.70677042007446,
  72.27509593963623,
  72.56669998168945,
  71.62123131752014),
 'test_time': (15.437731504440308,
  15.659624099731445,
  15.555923700332642,
  11.200460433959961,
  17.27031683921814)}

Let us now use the trained model to arrive at predictions.

#### Predict

Let's first see which movies user # 1 has already viewed.

In [None]:
user_1_ratings = ratings_melted[ratings_melted['user_id'] == 'User1']
print(user_1_ratings)


    JokeId user_id  rating
0        0   User1    5.10
1        1   User1    4.90
2        2   User1    1.75
3        3   User1   -4.17
4        4   User1    5.15
..     ...     ...     ...
95      95   User1    6.31
96      96   User1   -4.95
97      97   User1   -0.19
98      98   User1    3.25
99      99   User1    4.37

[100 rows x 3 columns]


Now, let's predict what rating user # 1 would give to movie # 302 (since he/she hasn't seen it yet)

In [None]:
mdlSvdJksRtg.predict(1, 302)

Prediction(uid=1, iid=302, r_ui=None, est=0.7432186814010393, details={'was_impossible': False})

One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

An extension to this could be to create a hybrid model that uses content filtering in the initial phase when user preferences are not available, and then gradually shift to a collaborative filtering model blended with some content filtering.

## Takeaways

* Introduced content-based filtering to recommend items based on their descriptions using *TF-IDF Vectorization*
* In the event that user preference data is available, collaborative filtering is leveraged to recommend items based on other similar users using *Singular Value Decomposition*