## Pre-Processing

#### Data:
`interactions.csv`\
`recipes.csv`\
`users.csv`

#### Introduction:
*The goal of the preprocessing work is to prepare your data for fitting models. If you identified some categorical features in your dataset in the EDA step, now is the time to create dummy features to allow for the inclusion of those
features in your model development. Additionally, standardizing your features numeric magnitude and creating train and test splits happen in this step. You may want to save a version of your clean, preprocessed data frame as a CSV to access later.*

#### General Steps:
- Create dummy or indicator features for categorical variables: `get_dummies()`
- Standardize the magnitude of numeric features using a scaler: `StandardScaler()`
- Split into testing and training datasets:  `like train_test_split()`

#### For collaborative filtering:
- Create one dataset for users and recipes

In [44]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()
import sklearn

from sklearn.model_selection import train_test_split

# Show plots inline
%matplotlib inline

In [45]:
interactions = pd.read_csv('interactions.csv',index_col=[0])
recipes = pd.read_csv('recipes.csv',index_col=[0])
users = pd.read_csv('users.csv',index_col=[0])

In [46]:
recipes.head(1)

Unnamed: 0,recipe,recipe_id,minutes,user_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,rating,count
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,5.0,3


In [47]:
interactions.head(2)

Unnamed: 0,user_id,recipe_id,date,rating,review,word_count,review_clean
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...,27,great with a salad. cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall...",31,"so simple, so delicious! great for chilly fall..."


In [48]:
interactions = interactions.drop(columns=['review'])
interactions.head(1)

Unnamed: 0,user_id,recipe_id,date,rating,word_count,review_clean
0,38094,40893,2003-02-17,4,27,great with a salad. cooked on top of stove for...


In [49]:
users.head(2)

Unnamed: 0,user_id,count,initial_date,final_date,days_active,frequency,rating
0,1533,128,2002-02-19,2008-03-01,2202 days,0.058129,4.710938
1,1535,794,2004-05-22,2018-03-03,5033 days,0.157759,4.473552


In [50]:
users.describe()

Unnamed: 0,user_id,count,frequency,rating
count,226419.0,226419.0,54856.0,226419.0
mean,594002000.0,5.000455,0.093665,3.87246
std,901389800.0,49.679536,0.350759,1.77354
min,1533.0,1.0,0.000365,0.0
25%,553027.5,1.0,0.004784,3.8
50%,1578078.0,1.0,0.01145,5.0
75%,1803500000.0,2.0,0.036145,5.0
max,2002373000.0,7671.0,12.0,5.0


In [51]:
print(users.shape)
print(interactions.shape)
print(recipes.shape)

(226419, 7)
(1132198, 6)
(230921, 14)


## Categorical Variables

Dont need to categorize any variables because we use the TIDIF vectorizor

In [40]:
#dfo = df.select_dtypes(include=['object']) # select object type columns
#df = pd.concat([df.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1)

In [52]:
review_comments = []
for i in range(len(interactions['review_clean'])):
    l = interactions['review_clean'][i].split()
    review_comments.extend(l)

In [53]:
type(review_comments)

list

In [54]:
recipe_titles = []
for i in range(len(recipes['recipe'])):
    l = recipes['recipe'][i].split()
    recipe_titles.extend(l)

In [55]:
type(recipe_titles)

list

### TFIDF on Recipe Titles

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [56]:
# Replace all non letter characters with a whitespace
recipes['recipe_clean'] = recipes['recipe'].str.replace('[^a-zA-Z]', ' ')

#Change to lower case
recipes['recipe_clean'] = recipes['recipe_clean'].str.lower()

# Print the first 5 rows of the text_clean column
print(recipes['recipe_clean'].head())

0    arriba   baked winter squash mexican style
1              a bit different  breakfast pizza
2                     all in the kitchen  chili
3                            alouette  potatoes
4            amish  tomato ketchup  for canning
Name: recipe_clean, dtype: object


In [58]:
# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=400, stop_words='english')

# Fit the vectroizer and transform the data
# split with .fit() and .transform()
tv_transformed = tv.fit_transform(recipes['recipe_clean'])

# Create a DataFrame with these features
tv_recipe_names = pd.DataFrame(tv_transformed.toarray(), 
                     columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
print(tv_recipe_names.head(2))

   TFIDF_alfredo  TFIDF_almond  TFIDF_angel  TFIDF_apple  TFIDF_apples  \
0            0.0           0.0          0.0          0.0           0.0   
1            0.0           0.0          0.0          0.0           0.0   

   TFIDF_applesauce  TFIDF_apricot  TFIDF_artichoke  TFIDF_asian  \
0               0.0            0.0              0.0          0.0   
1               0.0            0.0              0.0          0.0   

   TFIDF_asparagus  ...  TFIDF_white  TFIDF_wild  TFIDF_wine  TFIDF_wings  \
0              0.0  ...          0.0         0.0         0.0          0.0   
1              0.0  ...          0.0         0.0         0.0          0.0   

   TFIDF_wraps  TFIDF_ww  TFIDF_yellow  TFIDF_yogurt  TFIDF_yummy  \
0          0.0       0.0           0.0           0.0          0.0   
1          0.0       0.0           0.0           0.0          0.0   

   TFIDF_zucchini  
0             0.0  
1             0.0  

[2 rows x 400 columns]


In [60]:
title_word_frequency = tv_recipe_names.sum().sort_values(ascending=False)

### TFIDF on Reviews

In [61]:
#interactions['review']
interactions.head()

Unnamed: 0,user_id,recipe_id,date,rating,word_count,review_clean
0,38094,40893,2003-02-17,4,27,great with a salad. cooked on top of stove for...
1,1293707,40893,2011-12-21,5,31,"so simple, so delicious! great for chilly fall..."
2,8937,44394,2002-12-01,4,19,this worked very well and is easy. i used not...
3,126440,85009,2010-02-27,5,13,i made the mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,12,"made the cheddar bacon topping, adding a sprin..."


In [62]:
# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=600, stop_words='english')

# Fit the vectroizer and transform the data
# split with .fit() and .transform()
tv_transformed = tv.fit_transform(interactions['review_clean'])

# Create a DataFrame with these features
tv_reviews = pd.DataFrame(tv_transformed.toarray(), 
                     columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
print(tv_reviews.head(2))

   TFIDF_039  TFIDF_10  TFIDF_12  TFIDF_15  TFIDF_20  TFIDF_30  TFIDF_able  \
0        0.0       0.0       0.0  0.353267       0.0       0.0         0.0   
1        0.0       0.0       0.0  0.000000       0.0       0.0         0.0   

   TFIDF_absolutely  TFIDF_actually  TFIDF_add  ...  TFIDF_written  \
0               0.0             0.0        0.0  ...            0.0   
1               0.0             0.0        0.0  ...            0.0   

   TFIDF_wrong  TFIDF_year  TFIDF_years  TFIDF_yellow  TFIDF_yogurt  \
0          0.0         0.0          0.0           0.0           0.0   
1          0.0         0.0          0.0           0.0           0.0   

   TFIDF_yum  TFIDF_yummy  TFIDF_zaar  TFIDF_zucchini  
0        0.0          0.0         0.0             0.0  
1        0.0          0.0         0.0             0.0  

[2 rows x 600 columns]


In [63]:
review_word_frequency = tv_reviews.sum().sort_values(ascending=False)

In [64]:
review_word_frequency

TFIDF_recipe     68964.561556
TFIDF_thanks     47138.922346
TFIDF_used       46582.038343
TFIDF_good       46566.897430
TFIDF_great      45923.824496
                     ...     
TFIDF_step        2121.489910
TFIDF_started     2030.695626
TFIDF_lt          2008.821189
TFIDF_foil        2005.044858
TFIDF_gt          1987.406286
Length: 600, dtype: float64

## Scaling

Dont need to scale the data for collaborative filtering

In [41]:
# Making a Scaler object
# scaler = preprocessing.StandardScaler()
# Fitting data to the scaler object
# scaled_df = scaler.fit_transform(df)
# scaled_df = pd.DataFrame(scaled_df, columns=names)

## Split Train & Test Datasets

Dont need to split the data for collaborative filtering

In [42]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Prepare a cohesive dataset for collaborative filtering.
This essentially only needs recipe IDs and users ratings on recipes

In [76]:
interactions.columns

Index(['user_id', 'recipe_id', 'date', 'rating', 'word_count', 'review_clean'], dtype='object')

In [77]:
cf = interactions.drop(columns = ['date', 'word_count', 'review_clean'] )

In [78]:
cf.head()

Unnamed: 0,user_id,recipe_id,rating
0,38094,40893,4
1,1293707,40893,5
2,8937,44394,4
3,126440,85009,5
4,57222,85009,5


In [96]:
cf.to_csv('user_collaborative_filtering.csv')