# Data Scientist Professional Practical Exam Submission

The objective of this project is to assist the product team by predicting which recipes are likely to generate high traffic when featured on the homepage of Tasty Bytes. This capability could significantly increase site traffic and, consequently, subscriptions.

We'll start by validating the data, including removing any duplicates, filling in missing values, and correcting data types of the columns in our dataframe. Next, we'll explore the data to gather insights, which will guide us in developing an appropriate machine learning model. After building the model, we'll evaluate its effectiveness. Finally, we will relate our findings to business metrics and offer recommendations to enhance business outcomes.

### Data Validation

In [13]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import boxcox, yeojohnson
from scipy.stats.mstats import winsorize
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score

# Import csv file
recipe_site_traffic = pd.read_csv("recipe_site_traffic_2212.csv")

# Looking for the first rows of the data
recipe_site_traffic.head(10)

Unnamed: 0,recipe,calories,carbohydrate,sugar,protein,category,servings,high_traffic
0,1,,,,,Pork,6,High
1,2,35.48,38.56,0.66,0.92,Potato,4,High
2,3,914.28,42.68,3.09,2.88,Breakfast,1,
3,4,97.03,30.56,38.63,0.02,Beverages,4,High
4,5,27.05,1.85,0.8,0.53,Beverages,4,
5,6,691.15,3.46,1.65,53.93,One Dish Meal,2,High
6,7,183.94,47.95,9.75,46.71,Chicken Breast,4,
7,8,299.14,3.17,0.4,32.4,Lunch/Snacks,4,
8,9,538.52,3.78,3.37,3.79,Pork,6,High
9,10,248.28,48.54,3.99,113.85,Chicken,2,


In [14]:
# Check if there are duplicates
recipe_site_traffic.duplicated(subset='recipe').sum()

np.int64(0)

Since there are no duplicated recipes, we don't have to remove any rows. Now we will examine the dimensions of our dataset. Next, we'll analyze the dataset's structure by looking at the number of rows and columns, as well as identifying the names of the columns and their data types. We'll also check for the presence of non-null values in each column. Should there be a need for data type conversions or if missing values are found, we will take necessary measures to correct these issues.

In [15]:
# Looking for the data types
recipe_site_traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 947 entries, 0 to 946
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   recipe        947 non-null    int64  
 1   calories      895 non-null    float64
 2   carbohydrate  895 non-null    float64
 3   sugar         895 non-null    float64
 4   protein       895 non-null    float64
 5   category      947 non-null    object 
 6   servings      947 non-null    object 
 7   high_traffic  574 non-null    object 
dtypes: float64(4), int64(1), object(3)
memory usage: 59.3+ KB


In [16]:
# Check the values of 'servings'
recipe_site_traffic['servings'].value_counts()

servings
4               389
6               197
2               183
1               175
4 as a snack      2
6 as a snack      1
Name: count, dtype: int64

In [17]:
# Replace the rows written "as a snack" with their belonging numeric numbers
recipe_site_traffic['servings'] = recipe_site_traffic['servings'].str.replace(" as a snack", "")

# Check for the values
recipe_site_traffic['servings'].value_counts()

servings
4    391
6    198
2    183
1    175
Name: count, dtype: int64

In [18]:
# Convert data types of servings column to integer
recipe_site_traffic['servings'] = recipe_site_traffic['servings'].astype('int')

# Check for the values of high_traffic column
recipe_site_traffic['high_traffic'].value_counts()

high_traffic
High    574
Name: count, dtype: int64

In [19]:
# Replace the "High" values with True and null values with False
recipe_site_traffic['high_traffic'] = np.where(recipe_site_traffic['high_traffic'] == "High", True, False)

# Checking for the values of high_traffic column again
recipe_site_traffic['high_traffic'].value_counts()

high_traffic
True     574
False    373
Name: count, dtype: int64

In [20]:
# Check for the values of category column
recipe_site_traffic['category'].value_counts()

category
Breakfast         106
Chicken Breast     98
Beverages          92
Lunch/Snacks       89
Potato             88
Pork               84
Vegetable          83
Dessert            83
Meat               79
Chicken            74
One Dish Meal      71
Name: count, dtype: int64