<a href="https://colab.research.google.com/github/servetgulnaroglu/Forecasting/blob/main/notebooks/n8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color='#475468'> Joke Recommendations:</font>

## Initialize

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
# Joke metadata
dfJokeTxt = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/JokeText.csv')

# User ratings for each joke
dfUserRatings1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/UserRatings1.csv')
dfUserRatings2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/UserRatings2.csv')

In [92]:
# dfUserRatings = pd.concat([dfUserRatings1, dfUserRatings2], axis=1)
dfUserRatings = pd.concat([dfUserRatings1.set_index('JokeId'), dfUserRatings2.set_index('JokeId')], axis=1).reset_index()


In [93]:
dfUserRatings.shape

(100, 73422)

### 1. Content Based Filtering

#### Prepare data

In [15]:
dfJokeTxt.head()

Unnamed: 0,JokeId,JokeText
0,0,"A man visits the doctor. The doctor says ""I ha..."
1,1,This couple had an excellent relationship goin...
2,2,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,3,Q. What's the difference between a man and a t...
4,4,Q.\tWhat's O. J. Simpson's Internet address? \...


In [16]:
dfJokeTxt.shape

(100, 2)

In [17]:
# Remove duplicates
dfJokeTxt.drop_duplicates(subset ='JokeText', keep = 'first', inplace = True)
dfJokeTxt.shape

(100, 2)

In [94]:
dfUserRatings.head()

Unnamed: 0,JokeId,User1,User2,User3,User4,User5,User6,User7,User8,User9,...,User73412,User73413,User73414,User73415,User73416,User73417,User73418,User73419,User73420,User73421
0,0,5.1,-8.79,-3.5,7.14,-8.79,9.22,-4.03,3.11,-3.64,...,,,,,,,,,,
1,1,4.9,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,...,,,,,,,,,,
2,2,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,...,,,,,,,,,,
3,3,-4.17,-4.61,-0.1,0.05,8.98,9.27,-6.99,0.49,-3.4,...,,,,,,,,,,
4,4,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,...,3.64,4.32,6.99,-9.66,-8.4,-0.63,9.51,-7.67,-1.6,8.3


We have lot's of NotANumber values, I will set them as the average rating per row.

In [95]:
dfUserRatings = dfUserRatings.apply(lambda row: row.fillna(row.mean()), axis=1)

In [96]:
dfUserRatings.head()

Unnamed: 0,JokeId,User1,User2,User3,User4,User5,User6,User7,User8,User9,...,User73412,User73413,User73414,User73415,User73416,User73417,User73418,User73419,User73420,User73421
0,0.0,5.1,-8.79,-3.5,7.14,-8.79,9.22,-4.03,3.11,-3.64,...,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968
1,1.0,4.9,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,...,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013
2,2.0,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,...,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467
3,3.0,-4.17,-4.61,-0.1,0.05,8.98,9.27,-6.99,0.49,-3.4,...,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454
4,4.0,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,...,3.64,4.32,6.99,-9.66,-8.4,-0.63,9.51,-7.67,-1.6,8.3


In [97]:
missing_values = dfUserRatings.isnull().sum().sum()
missing_values

0

#### Build Model

In [98]:
jokeText = 'JokeText'

In [99]:
# Generate a matrix of common terms that show up in each joke

from sklearn.feature_extraction.text import TfidfVectorizer
mdlTfvMvs = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
tfidf_matrix = mdlTfvMvs.fit_transform(dfJokeTxt[jokeText])
tfidf_matrix.shape

(100, 3774)

In [100]:
# Calculate cosine similarity between each pair of movies as a function of the similarity of the common terms

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(100, 100)

#### Predict

In [101]:
# Prepare recommendation function (build code from scratch and then package as function for ease of understanding)

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(dfJokeTxt['JokeText'])

def get_recommendations(keyword, top_n=10):
    keyword_vector = vectorizer.transform([keyword])
    similarities = cosine_similarity(keyword_vector, tfidf_matrix)
    sorted_indices = similarities.argsort()[0][::-1]
    if similarities[0, sorted_indices[0]] == 0:
        return ["No jokes found with the keyword."]
    return dfJokeTxt['JokeText'].iloc[sorted_indices[:top_n]]

In [102]:
def print_recommendations(keyword, top_n=10):
    recommendations = get_recommendations(keyword, top_n)
    for joke in recommendations:
        print(joke)
        print("\n")

In [103]:
print_recommendations("wife", 2)

A couple has been married for 75 years. For the husband's 95th
birthday, his wife decides to surprise him by hiring a prostitute.
That day, the doorbell rings. The husband uses his walker to get to
the door and opens it. 
A 21-year-old in a latex outfit smiles and
says, "Hi, I here to give you super sex!" 
The old man says, "I'll take the soup."



A guy stood over his tee shot for what seemed an eternity, looking up, looking down, measuring the distance,
figuring the wind direction and speed. Driving his partner nuts.

Finally his exasperated partner says, "What the hell is taking so long? Hit the goddamn ball!"
The guy answers, "My wife is up there watching me from the clubhouse. I want to make this a perfect shot."
"Well, hell, man, you don't stand a snowball's chance in hell of hitting her from here!" 





In [104]:
print_recommendations("gta 5", 2)

No jokes found with the keyword.




In [105]:
print_recommendations("doctor",3)

A man visits the doctor. The doctor says "I have bad news for you.You have
cancer and Alzheimer's disease". 
The man replies "Well,thank God I don't have cancer!"



A man, recently completing a routine physical examination receives a
phone call from his doctor.  The doctor says, "I have some good news and
some bad news."  The man says, "OK, give me the good news first."  The
doctor says, "The good news is, you have 24 hours to live."  The man
replies, "Shit!  That's the good news?  Then what's the bad news?"

The doctor says, "The bad news is, I forgot to call you yesterday."



A Czechoslovakian man felt his eyesight was growing steadily worse, and 
felt it was time to go see an optometrist. 

The doctor started with some simple testing, and showed him a standard eye 
chart with letters of
diminishing size: CRKBNWXSKZY. . . 

"Can you read this?" the doctor asked. 

"Read it?" the Czech answered. "Doc, I know him!"





utexas_ds_orie_divider_gray.png

### 2. Collaborative Filtering

In [106]:
dfJokeTxt.head(10)

Unnamed: 0,JokeId,JokeText
0,0,"A man visits the doctor. The doctor says ""I ha..."
1,1,This couple had an excellent relationship goin...
2,2,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,3,Q. What's the difference between a man and a t...
4,4,Q.\tWhat's O. J. Simpson's Internet address? \...
5,5,Bill & Hillary are on a trip back to Arkansas....
6,6,How many feminists does it take to screw in a ...
7,7,Q. Did you hear about the dyslexic devil worsh...
8,8,A country guy goes into a city bar that has a ...
9,9,"Two cannibals are eating a clown, one turns to..."


Process Data

In [107]:
dfUserRatings

Unnamed: 0,JokeId,User1,User2,User3,User4,User5,User6,User7,User8,User9,...,User73412,User73413,User73414,User73415,User73416,User73417,User73418,User73419,User73420,User73421
0,0.0,5.10,-8.79,-3.50,7.14,-8.79,9.22,-4.03,3.11,-3.64,...,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968,0.901968
1,1.0,4.90,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,...,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013,0.163013
2,2.0,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,...,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467,0.193467
3,3.0,-4.17,-4.61,-0.10,0.05,8.98,9.27,-6.99,0.49,-3.40,...,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454,-1.412454
4,4.0,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,...,3.640000,4.320000,6.990000,-9.660000,-8.400000,-0.630000,9.510000,-7.670000,-1.600000,8.300000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95.0,6.31,-1.02,3.98,3.93,9.13,1.94,0.44,1.21,6.94,...,2.280000,1.377251,1.377251,1.377251,1.377251,1.377251,1.377251,1.377251,1.377251,1.377251
96,96.0,-4.95,-0.97,-6.46,-2.57,9.17,1.99,-0.78,5.34,5.83,...,1.493296,1.493296,1.493296,1.493296,1.493296,0.780000,1.493296,1.493296,1.493296,1.493296
97,97.0,-0.19,4.13,-6.89,1.07,9.17,3.45,-1.02,1.94,5.53,...,0.874097,0.874097,0.874097,0.874097,0.874097,0.874097,0.874097,0.874097,0.874097,0.874097
98,98.0,3.25,-1.84,-2.33,2.33,9.08,9.17,1.70,3.06,6.55,...,7.330000,-0.031961,-0.031961,-0.031961,-0.031961,-0.031961,-0.031961,-0.031961,-0.031961,-0.031961


In [108]:
duplicate_columns = dfUserRatings.columns[dfUserRatings.columns.duplicated()]
print(f"Duplicate columns: {duplicate_columns}")

Duplicate columns: Index([], dtype='object')


In [109]:
# # Rename or remove duplicate columns
# if not duplicate_columns.empty:
#     # Example approach: append suffix to duplicate columns
#     new_columns = []
#     for col in dfUserRatings.columns:
#         if col in new_columns:
#             new_columns.append(col + '_dup')
#         else:
#             new_columns.append(col)
#     dfUserRatings.columns = new_columns

# # Ensure JokeId column exists
# if 'JokeId' not in dfUserRatings.columns:
#     raise ValueError("The 'JokeId' column is not found in the dataframe.")

# Melt the dataframe into long format
dfRatings = pd.melt(dfUserRatings, id_vars=['JokeId'], var_name='UserId', value_name='Rating')

# Convert UserId to integer, since it starts with 'User' prefix
dfRatings['UserId'] = dfRatings['UserId'].str.extract('(\d+)').astype(int)
# Now dfRatings has the columns JokeId, UserId, Rating
print(dfRatings.head())

   JokeId  UserId  Rating
0     0.0       1    5.10
1     1.0       1    4.90
2     2.0       1    1.75
3     3.0       1   -4.17
4     4.0       1    5.15


In [110]:
dfRatings['JokeId'] = dfRatings['JokeId'].astype(int)

In [111]:
dfRatings.head()

Unnamed: 0,JokeId,UserId,Rating
0,0,1,5.1
1,1,1,4.9
2,2,1,1.75
3,3,1,-4.17
4,4,1,5.15


In [51]:
dfRatings.shape

(7342300, 3)

#### Build Model

In [112]:
# Prepare data into Surprise library format

!pip3 install scikit-surprise #or !conda install -c conda-forge scikit-surprise
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(0,5))
X = Dataset.load_from_df(dfRatings[['JokeId', 'UserId', 'Rating']], reader)
X_train, X_test = train_test_split(X, test_size=.25)

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357283 sha256=98e762ff19f7a0319c085db2c0c66ac7392205a93ba5dc785f6abae2911a6214
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succe

In [115]:
# Define SVD model

from surprise import SVD

mdlSvdRtg = SVD()

In [117]:
# Fit SVD model

mdlSvdRtg.fit(X_train)
test_pred = mdlSvdRtg.test(X_test)

In [118]:
# Evalute SVD accuracy

from surprise import accuracy

accuracy.rmse(test_pred)

RMSE: 3.6867


3.6866882576475257

In [126]:
# Tune hyperparameters

from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [15], 'lr_all': [0.002],
              'reg_all': [0.4]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=2)

gs.fit(X)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

3.656364420125345
{'n_epochs': 15, 'lr_all': 0.002, 'reg_all': 0.4}


In [128]:
# Cross-validate

from surprise.model_selection import cross_validate

cross_validate(mdlSvdRtg, X, measures=['RMSE', 'MAE'], cv=2, verbose=True)

#Result
# Evaluating RMSE, MAE of algorithm SVD on 2 split(s).

#                   Fold 1  Fold 2  Mean    Std
# RMSE (testset)    3.6354  3.6324  3.6339  0.0015
# MAE (testset)     2.5321  2.5267  2.5294  0.0027
# Fit time          120.62  126.72  123.67  3.05
# Test time         46.25   46.99   46.62   0.37
# {'test_rmse': array([3.63544961, 3.63244644]),
#  'test_mae': array([2.53211614, 2.52668242]),
#  'fit_time': (120.62142276763916, 126.71864628791809),
#  'test_time': (46.25222826004028, 46.99187254905701)}

Evaluating RMSE, MAE of algorithm SVD on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)    3.6354  3.6324  3.6339  0.0015  
MAE (testset)     2.5321  2.5267  2.5294  0.0027  
Fit time          120.62  126.72  123.67  3.05    
Test time         46.25   46.99   46.62   0.37    


{'test_rmse': array([3.63544961, 3.63244644]),
 'test_mae': array([2.53211614, 2.52668242]),
 'fit_time': (120.62142276763916, 126.71864628791809),
 'test_time': (46.25222826004028, 46.99187254905701)}

Let us now use the trained model to arrive at predictions.

#### Predict

Let's first see which movies user # 1 has already viewed.

In [131]:
dfRatings[dfRatings['JokeId'] == 10]

Unnamed: 0,JokeId,UserId,Rating
10,10,1,4.220000
110,10,2,7.720000
210,10,3,3.450000
310,10,4,-3.060000
410,10,5,6.890000
...,...,...,...
7341610,10,73417,1.728859
7341710,10,73418,1.728859
7341810,10,73419,1.728859
7341910,10,73420,1.728859


<!-- Now, let's predict what rating user # 1 would give to movie # 302 (since he/she hasn't seen it yet) -->
We can see 1.728859 is repeated many times, so it is average value we calculated earlier,
We are going to select userId with 73417 to see how the user would rate the joke,

In [132]:
mdlSvdRtg.predict(10, 73417)

Prediction(uid=10, iid=73417, r_ui=None, est=1.9472399243280263, details={'was_impossible': False})

In [133]:
mdlSvdRtg.predict(10, 73418)

Prediction(uid=10, iid=73418, r_ui=None, est=1.5592892296308407, details={'was_impossible': False})

In [134]:
mdlSvdRtg.predict(10, 73419)

Prediction(uid=10, iid=73419, r_ui=None, est=1.0976352838648578, details={'was_impossible': False})

In [135]:
mdlSvdRtg.predict(10, 73420)

Prediction(uid=10, iid=73420, r_ui=None, est=1.5730092940282059, details={'was_impossible': False})

According to above result, users with id 73417, 73418, 73419, 73420 would rate the joke as in the 'est' result

## Takeaways

* Learned content-based filtering to recommend items based on their descriptions using *TF-IDF Vectorization*
* In the event that user preference data is available, collaborative filtering is leveraged to recommend items based on other similar users using *Singular Value Decomposition*