<a href="https://colab.research.google.com/github/sjoseph25/data_2000/blob/main/midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA-2000 Midterm Exam

## Recipe Rating Prediction

For this exercise, we are going to use a dataset of recipes and their ratings, taken from [the website Epicurious](https://www.epicurious.com/recipes-menus).

Our dataset contains basic information about the dish (its name, description, ingredients, and directions), as well as nutritional content (calories, protein, sodium, and fat contents). Based on this information, we want to try and predict how well or poorly the dish will be rated by users.


## Grading Rubric

This midterm will be worth 15% of your total grade for this course. It will be graded out of 50 points, divided into 4 sections:

  - Data Prep: 10 points
    - 5 points will be awarded for the actual data cleaning (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale for the data quality checks that you chose to use
  - Feature Engineering: 12 points
    - 2 points will be awarded by default, but may be subtracted from if there are substantial errors in your data prep that reduce the quality of your engineered features
    - 5 points will be awarded for the actual feature engineering (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale
  - Model Building: 14 points
    - 4 points will be awarded by default, but may be subtracted from if there are substantial errors in your feature engineering that reduce the quality of your model
    - 5 points will be awarded for the actual model building (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale
  - Model Validation/Evaluation: 14 points
    - 4 points will be awarded by default, but may be subtracted from if there are substantial errors in your model building that negatively impact the validity of your model
    - 5 points will be awarded for the actual model validation and evaluation (evaluating your Python code)
    - 5 points will be awarded for the text commentary narrating your choices and explaining your rationale

> **NOTE:** You will NOT be evaluated on whether you model actually makes accurate predictions or not


## Using Additional Resources

This is an open-resource exam. You may use any available resources as references. I will be available for any questions that you have during the exam.

Remember that all work must still be your own, and that this exam is governed by the [Policy on Academic Honesty outlined in our course syllabus](https://docs.google.com/document/d/1Aoh7LvTKTEZO74eOsNhLzorkLtljkuchpg3ScNM_VEs/edit#heading=h.r0b18a8gh450).

-----

## Importing the Data

First, let's download our dataset and take a look at what it contains:

In [150]:
import numpy as np

In [151]:
import pandas as pd

data = pd.read_json('https://cdn.c18l.org/full_format_recipes.json')

In [152]:
data.head()

Unnamed: 0,directions,fat,date,categories,calories,desc,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,2006-09-01 04:00:00+00:00,"[Sandwich, Bean, Fruit, Tomato, turkey, Vegeta...",426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,2004-08-20 04:00:00+00:00,"[Food Processor, Onion, Pork, Bake, Bastille D...",403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,2004-08-20 04:00:00+00:00,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,2009-03-27 04:00:00+00:00,"[Fish, Olive, Tomato, Sauté, Low Fat, Low Cal,...",,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,2004-08-20 04:00:00+00:00,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


## Data Prep & Cleaning

Perform any data quality checks and data cleaning that you believe is appropriate. Convert any categorical columns to numeric ones, if needed. Provide a narrative explanation of your choices to accompany any code.

In [153]:
data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,15908.0,15976.0,15929.0,20100.0,15974.0
mean,346.0975,6307.857,99.946199,3.71306,6211.474
std,20431.02,358585.1,3835.616663,1.343144,332890.3
min,0.0,0.0,0.0,0.0,0.0
25%,7.0,198.0,3.0,3.75,80.0
50%,17.0,331.0,8.0,4.375,294.0
75%,33.0,586.0,27.0,4.375,711.0
max,1722763.0,30111220.0,236489.0,5.0,27675110.0


In [154]:
data.columns

Index(['directions', 'fat', 'date', 'categories', 'calories', 'desc',
       'protein', 'rating', 'title', 'ingredients', 'sodium'],
      dtype='object')

Based on the data description as well as the columns and actual data, I am going to do several different things to clean up the dataframe. First, I am going to drop the date column because I do not forsee it being useful to me. I am also going to limit the data to items that fall at or under the 75% cutoff for fat, protein, sodium, and calories columns. I am choosing to do this because the max values for those columns are (to me) unbelievably high -- 236,489 g of protein is a lot.

In [155]:
data = data.drop(columns = ['date'])

data = data.loc[data['fat'] <= 33.0]

data = data.loc[data['calories'] <= 586]

data = data.loc[data['protein'] <= 27.0]

data = data.loc[data['sodium'] <= 711.0]

In [156]:
data.head()

Unnamed: 0,directions,fat,categories,calories,desc,protein,rating,title,ingredients,sodium
2,[In a large heavy saucepan cook diced fennel a...,7.0,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0
10,[Heat oil in heavy large skillet over medium-h...,5.0,"[Milk/Cream, Dairy, Side, Thanksgiving, Rosema...",256.0,Simmering the yams fills them with flavor and ...,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0
13,[Sprinkle steaks with salt and pepper. Heat oi...,12.0,"[Garlic, Sauté, Low Carb, Quick & Easy, Wheat/...",174.0,This recipe can be prepared in 45 minutes or l...,11.0,4.375,Beef Tenderloin with Garlic and Brandy,[4 6- to 7-ounce beef tenderloin steaks (each ...,176.0
16,[Butter and sugar six 2/3-to 3/4-cup ramekins....,5.0,"[Bread, Milk/Cream, Breakfast, Brunch, Dessert...",146.0,Classic spoon bread is a savory pudding served...,4.0,1.875,Sweet Buttermilk Spoon Breads,"[1 cup water, 2/3 cup buttermilk, 1/3 cup heav...",160.0


In [157]:
data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,9010.0,9010.0,9010.0,9007.0,9010.0
mean,11.347614,238.458047,5.617647,3.630163,188.667259
std,8.763359,127.454756,5.471992,1.432727,187.79506
min,0.0,0.0,0.0,0.0,0.0
25%,4.0,146.0,2.0,3.75,30.0
50%,11.0,229.0,4.0,4.375,123.0
75%,18.0,313.75,8.0,4.375,303.0
max,33.0,586.0,27.0,5.0,711.0


After looking at the data.describe() output from my first data cleaning, I noticed that the rating count is 3 lower than the count of each fat, calories, protein, and sodium. I am going to remove the missing rating values, as they will be no help to the model.

In [158]:
data = data.dropna(subset = ['rating'])

In [159]:
data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,9007.0,9007.0,9007.0,9007.0,9007.0
mean,11.347507,238.456756,5.618186,3.630163,188.686355
std,8.76352,127.463523,5.472804,1.432727,187.814938
min,0.0,0.0,0.0,0.0,0.0
25%,4.0,146.0,2.0,3.75,30.0
50%,11.0,229.0,4.0,4.375,123.0
75%,18.0,313.5,8.0,4.375,303.0
max,33.0,586.0,27.0,5.0,711.0


## Feature Engineering

Develop any new feature(s) that you feel may be relevant to a model. Provide a narrative explanation of your choices to accompany any code.

To help, I've included a `column_builder()` utility function that will create a new boolean column based on whether a string of text appears in any of (1) the recipe title; (2) the recipe description; or (3) the recipe tags.

In [160]:
# def column_builder(category: str, dataset: pd.DataFrame) -> pd.DataFrame:
#     dataset[f'is_{category}'] = ((
#         dataset['categories'].apply(f'{category}', na=False, case=False)
#     ) | (
#         dataset['title'].str.contains(f'{category}', na=False, case=False)
#     ) | (
#         dataset['desc'].str.contains(f'{category}', na=False, case=False)
#     )).astype(int)

#     return dataset


# categories = [
#     'easy',
#     'breakfast', 'lunch', 'dinner',
#     'vegetarian', 'vegan',
# ]

# for category in categories:
#     data = column_builder(category, data)

# data['is_easy'].describe()

I added a few key words into the categories in the given code above this (lunch, dinner, vegetarian, vegan). I did not want to go overboard and add too many, thus overcrowding the dataset, so I stuck to a few big ones. **ended up not using these later on, which is why I commented out the code

Other engineered features I am interested in looking at are the nutrient ratios. I will be creating ratios of fat to protein, fat to sodium, and sodium to protein. It crossed my mind to ratio fat, sodium, and protein against calories but I am more interested in seeing how calories as a whole affects rating, as well as the nutrient ratios I already mentioned.

In [161]:
data['fat/sodium'] = data['fat'] / data ['sodium']
data['fat/protein'] = data['fat'] / data ['protein']
data['protein/sodium'] = data['protein'] / data ['sodium']

In [162]:
data.head()

Unnamed: 0,directions,fat,categories,calories,desc,protein,rating,title,ingredients,sodium,fat/sodium,fat/protein,protein/sodium
2,[In a large heavy saucepan cook diced fennel a...,7.0,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0,0.042424,1.166667,0.036364
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0,0.070796,1.6,0.044248
10,[Heat oil in heavy large skillet over medium-h...,5.0,"[Milk/Cream, Dairy, Side, Thanksgiving, Rosema...",256.0,Simmering the yams fills them with flavor and ...,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0,0.166667,1.25,0.133333
13,[Sprinkle steaks with salt and pepper. Heat oi...,12.0,"[Garlic, Sauté, Low Carb, Quick & Easy, Wheat/...",174.0,This recipe can be prepared in 45 minutes or l...,11.0,4.375,Beef Tenderloin with Garlic and Brandy,[4 6- to 7-ounce beef tenderloin steaks (each ...,176.0,0.068182,1.090909,0.0625
16,[Butter and sugar six 2/3-to 3/4-cup ramekins....,5.0,"[Bread, Milk/Cream, Breakfast, Brunch, Dessert...",146.0,Classic spoon bread is a savory pudding served...,4.0,1.875,Sweet Buttermilk Spoon Breads,"[1 cup water, 2/3 cup buttermilk, 1/3 cup heav...",160.0,0.03125,1.25,0.025


In [163]:
data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium,fat/sodium,fat/protein,protein/sodium
count,9007.0,9007.0,9007.0,9007.0,9007.0,8977.0,8477.0,8970.0
mean,11.347507,238.456756,5.618186,3.630163,188.686355,inf,inf,inf
std,8.76352,127.463523,5.472804,1.432727,187.814938,,,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.0,146.0,2.0,3.75,30.0,0.023891,1.0,0.015873
50%,11.0,229.0,4.0,4.375,123.0,0.061033,2.0,0.032434
75%,18.0,313.5,8.0,4.375,303.0,0.151515,4.0,0.076923
max,33.0,586.0,27.0,5.0,711.0,inf,inf,inf


## Model Building

Build a model (either a regression or a neural network) to predict a recipe's rating based on any relevant attributes that you defined in the prior steps.

You may choose to predict rating as a continuous value (0.0 to 5.0), or as a categorical (low/medium/high or similar).

Provide a narrative explanation of your choices to accompany any code.

I chose to do linear regressions. Personally, I prefer the linear regressions because I understand their theory the most out of the models we have looked at, and I am better at coding them than I am at logarithmic regressions and neural networks.

In [164]:
data.replace([np.inf, -np.inf], np.nan, inplace=True)
#to change infinite values to NaN
#code modeled from this thread https://stackoverflow.com/questions/17477979/dropping-infinite-values-from-dataframes-in-pandas

In [165]:
new_data = data.dropna(inplace = True)
#returns NoneType; new_data is not used at all but I left the code here just in case I chose to mess with it

In [166]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, train_size=0.2, random_state = 42)
model = LinearRegression().fit(
    X = train_data.loc[:, [
        'fat', 'calories', 'protein',
        'sodium', 'fat/sodium',
        'fat/protein', 'protein/sodium']],
    y = train_data['rating']
)

In [167]:
train_data, test_data = train_test_split(data, train_size=0.2, random_state = 42)
model2 = LinearRegression().fit(
    X = train_data.loc[:, [
        'fat', 'calories', 'protein',
        'sodium']],
    y = train_data['rating']
)

## Model Evaluation

After training your model, evaluate its performance. What metric(s) did you choose to optimize on? Would you say that your model performed well or poorly? How did you evaluate its performance to arrive at that conclusion?

In [168]:
model.score(
    X = train_data.loc[:, [
        'fat', 'calories', 'protein',
        'sodium', 'fat/sodium',
        'fat/protein', 'protein/sodium']],
    y = train_data['rating']
)

0.014169201251647334

The above linear regression has extremely poor performance shown by the 1.42% model score. I chose to optimize on the nutrients and their ratios for this regression.

In [169]:
model2.score(
    X = train_data.loc[:, [
        'fat', 'calories', 'protein',
        'sodium']],
    y = train_data['rating']
)

0.010166249395689664

This linear regression also exhibits poor performance as shown by the 1.02% model score on the nutrient and calorie measurements.

Implications of the two linear regressions thus far: nutrient and calorie measurement as well as the nutrient ratios are probably not great predictors of dish ratings. If I were to work with the dataset further, I would be interested to see if ingredients played a part in the dish ratings (that would require going through lists in a way I am not quite sure how to do, which is why I won't be attempting it here).

-----

# Midterm Submission

To submit this exam, in Canvas navigate to DATA-2000-51 > Assignments > Midterm Exam ([link](https://canvas.jcu.edu/courses/33514/assignments/407120)). You can either upload the `.ipynb` file directly to Canvas, or you can provide a link to the assignment on your GitHub.