# Data Scientist Professional Practical Exam Submission

**Use this template to write up your summary for submission. Code in Python or R needs to be included.**


## 📝 Task List

Your written report should include both code, output and written text summaries of the following:
- Data Validation:   
  - Describe validation and cleaning steps for every column in the data 
- Exploratory Analysis:  
  - Include two different graphics showing single variables only to demonstrate the characteristics of data  
  - Include at least one graphic showing two or more variables to represent the relationship between features
  - Describe your findings
- Model Development
  - Include your reasons for selecting the models you use as well as a statement of the problem type
  - Code to fit the baseline and comparison models
- Model Evaluation
  - Describe the performance of the two models based on an appropriate metric
- Business Metrics
  - Define a way to compare your model performance to the business
  - Describe how your models perform using this approach
- Final summary including recommendations that the business should undertake

*Start writing report here..*

## Initial Setup

In [2]:
import sys
sys.version

'3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]'

In [3]:
import pandas as pd

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, ConfusionMatrixDisplay, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.tree import DecisionTreeClassifier


ModuleNotFoundError: No module named 'numpy'

## Data Ingestion

In [None]:
df = pd.read_csv('recipe_site_traffic_2212.csv', index_col='recipe')

## Data Validation

In [None]:
# Round 'Nutritional Facts' columns so that they 
cols = ['calories', 'carbohydrate', 'sugar', 'protein']
df[cols] = df[cols].round()

# Wrangle 'servings' column. Extract only integers.
df['servings'] = df['servings'].str[:1].astype(int)

# Recode 'high_traffic' column where True = 'high_traffic'
df['high_traffic'] = np.where(df['high_traffic'].isnull(), False, True)

# Wrangle 'category' column
df['category'] = df['category'].str.replace(' Breast', '').astype('category')

# Add column to indicate if all the nutritional facts are displayed on the website
count_nan_per_row = df[cols].isna().sum(axis=1)
df['nutritional_facts_label'] = np.where(count_nan_per_row == len(cols), True, False)

## Exploratory Analysis

  - Include two different graphics showing single variables only to demonstrate the characteristics of data  
  - Include at least one graphic showing two or more variables to represent the relationship between features
  - Describe your findings

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().round(2)

### Target Variable

#### High_traffic

In [None]:
n = df['high_traffic'].value_counts().rename('n')
p = df['high_traffic'].value_counts(normalize=True).rename('p')

table = pd.concat([n, p], axis=1).rename_axis('high_traffic').reset_index()
table

In [None]:
sns.barplot(data=table, x='high_traffic', y='p')

## Feature Variables

### Calories

In [None]:
feature = 'calories'

x = df.groupby('high_traffic')[feature].mean().round(2).reset_index()
x

In [None]:
sns.barplot(data=df, x='high_traffic', y=feature);

In [None]:
sns.histplot(df[feature]);

### Carbohydrate

In [None]:
feature = 'carbohydrate'

x = df.groupby('high_traffic')[feature].mean().round(2).reset_index()
x

In [None]:
sns.barplot(data=df, x='high_traffic', y=feature);

In [None]:
sns.histplot(df[feature]);

### Sugar

In [None]:
feature = 'sugar'

x = df.groupby('high_traffic')[feature].mean().round(2).reset_index()
x

In [None]:
sns.barplot(data=df, x='high_traffic', y=feature);

In [None]:
sns.histplot(df[feature]);

### Protein

In [None]:
feature = 'protein'

x = df.groupby('high_traffic')[feature].mean().round(2).reset_index()
x

In [None]:
sns.barplot(data=df, x='high_traffic', y=feature);

In [None]:
sns.histplot(df[feature]);

### Category

In [None]:
feature = 'category'

df[feature].value_counts(normalize=True).round(2)

In [None]:
x = df.groupby(feature)['high_traffic'].mean().sort_values(ascending=False)
x


In [None]:
sns.barplot(data=df, x='high_traffic', y=feature, orient='h', order=x.index, palette='Blues_r');

### Servings

In [None]:
feature = 'servings'

sns.histplot(df[feature]);

In [None]:
df.groupby(feature)['high_traffic'].mean().round(2)

In [None]:
sns.barplot(data=df, x=feature, y='high_traffic', color='dodgerblue');

### Nutritional Facts Label

In [None]:
feature = 'nutritional_facts_label'

x = df[feature].value_counts(normalize=True)
x

In [None]:
df.groupby(feature)['high_traffic'].mean().round(2)

## Model Development
  - Include your reasons for selecting the models you use as well as a statement of the problem type
  - Code to fit the baseline and comparison models

- Predict which recipes will lead to high traffic?
- Correctly predict high traffic recipes 80% of the time?

Classification: Logistic Regression or Tree-based model
Key metric: Precision (Minimize false positives)

### Split

In [None]:
target = 'high_traffic'

X = df.drop(columns=target)
y = df[target]

print('X shape:', X.shape)
print('y shape:', y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

### Baseline

In [None]:
majority_class = y_train.value_counts(normalize=True).idxmax()
accuracy_baseline = y_train.value_counts(normalize=True).max()

majority_classifier = DummyClassifier(strategy='most_frequent')
majority_classifier.fit(X_train, y_train)
y_pred_majority = majority_classifier.predict(X_test)
precision_baseline = precision_score(y_test, y_pred_majority)

print('Majority_class:', majority_class)
print('Baseline Accuracy:', round(accuracy_baseline, 4))
print('Baseline precision:', round(precision_baseline, 4))

### Iterate

In [None]:
numeric_features = ['calories', 'carbohydrate', 'sugar', 'protein']

numeric_transformer = make_pipeline( 
    SimpleImputer()
)

preprocessor = ColumnTransformer(
    [('num', numeric_transformer, numeric_features)]
)

#### Logistic Regression

In [None]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

param_grid = {
    'preprocessor__num__simpleimputer__strategy': ['mean', 'median']
}

model_logreg = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
model_logreg.fit(X_train, y_train)

#### Decision Tree Classifier

In [None]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

param_grid = {
    'preprocessor__num__simpleimputer__strategy': ['mean', 'median'],
    'classifier__max_depth': range(10, 50, 10),
}

model_dt = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
model_dt.fit(X_train, y_train)

#### Random Forest

In [None]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

param_grid = {
    'preprocessor__num__simpleimputer__strategy': ['mean', 'median'],
    'classifier__max_depth': range(10, 50, 10),
    'classifier__n_estimators': range(25, 100, 25)
}

model_rf = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
model_rf.fit(X_train, y_train)

## Model Evaluation
  - Describe the performance of the two models based on an appropriate metric

#### Logistic Regression

In [None]:
models = [model_logreg, model_dt, model_rf]

for model in models:

    # Obtain model name
    model_name = type(model.best_estimator_['classifier']).__name__

    # Predictions
    y_train_pred = model_logreg.predict(X_train)
    y_test_pred = model_logreg.predict(X_test)
    
    # Accuracy scores
    accuracy_train = model.score(X_train, y_train)
    accuracy_test = model.score(X_test, y_test)

    # Precision scores
    precision_train = precision_score(y_train, y_train_pred)
    precision_test = precision_score(y_test, y_test_pred)


    print(f'Model: {model_name}', '\n')
    print('Training Accuracy:', '\t', round(accuracy_train, 4))
    print('Test Accuracy:', '\t\t', round(accuracy_test, 4), '\n')
    print('Training Precision:', '\t', round(precision_train, 4))
    print('Test Precision:', '\t', round(precision_test, 4), '\n\n')  

In [None]:
confusion_matrix(y_train, model_logreg.predict(X_train))

In [None]:
confusion_matrix(y_train, model_logreg.predict(X_train))

In [None]:
print(classification_report(y_train, model_logreg.predict(X_train)))

In [None]:
ConfusionMatrixDisplay.from_estimator(model_logreg, X_train, y_train)

In [None]:
import pandas


In [2]:
import sklearn
sklearn.__version__

ModuleNotFoundError: No module named 'sklearn'

In [None]:
!conda update scikit-learn

## Business Metrics
  - Define a way to compare your model performance to the business
  - Describe how your models perform using this approach

## Final Summary
- Final summary including recommendations that the business should undertake

## ✅ When you have finished...
-  Publish your Workspace using the option on the left
-  Check the published version of your report:
	-  Can you see everything you want us to grade?
    -  Are all the graphics visible?
-  Review the grading rubric. Have you included everything that will be graded?
-  Head back to the [Certification Dashboard](https://app.datacamp.com/certification) to submit your practical exam report and record your presentation