# Part V - Learning
## Project 5a - Shopping

[Course Link](https://cs50.harvard.edu/ai/)

[Project Instructions](https://cs50.harvard.edu/ai/projects/4/shopping/)

## Instructions
When users are shopping online, not all will end up purchasing something. Most visitors to an online shopping website, in fact, likely don’t end up going through with a purchase during that web browsing session. It might be useful, though, for a shopping website to be able to predict whether a user intends to make a purchase or not: perhaps displaying different content to the user, like showing the user a discount offer if the website believes the user isn’t planning to complete the purchase. How could a website determine a user’s purchasing intent? That’s where machine learning will come in.

Your task in this problem is to build a nearest-neighbor classifier to solve this problem. Given information about a user — how many pages they’ve visited, whether they’re shopping on a weekend, what web browser they’re using, etc. — your classifier will predict whether or not the user will make a purchase. Your classifier won’t be perfectly accurate — perfectly modeling human behavior is a task well beyond the scope of this class — but it should be better than guessing randomly. To train your classifier, we’ll provide you with some data from a shopping website from about 12,000 users sessions.

How do we measure the accuracy of a system like this? If we have a testing data set, we could run our classifier on the data, and compute what proportion of the time we correctly classify the user’s intent. This would give us a single accuracy percentage. But that number might be a little misleading. Imagine, for example, if about 15% of all users end up going through with a purchase. A classifier that always predicted that the user would not go through with a purchase, then, we would measure as being 85% accurate: the only users it classifies incorrectly are the 15% of users who do go through with a purchase. And while 85% accuracy sounds pretty good, that doesn’t seem like a very useful classifier.

Instead, we’ll measure two values: sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”). Sensitivity refers to the proportion of positive examples that were correctly identified: in other words, the proportion of users who did go through with a purchase who were correctly identified. Specificity refers to the proportion of negative examples that were correctly identified: in this case, the proportion of users who did not go through with a purchase who were correctly identified. So our “always guess no” classifier from before would have perfect specificity (1.0) but no sensitivity (0.0). Our goal is to build a classifier that performs reasonably on both metrics.

## Data Formatting and Preparation
I used Pandas to take care of training and label data, the main program is below this section, but I kept this in here to show the detailed steps taken to convert the data properly

In [104]:
df = pd.read_csv('shopping.csv')

In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

In [106]:
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [145]:
c_df = df.copy()

In [146]:
# Convert All Columns to Proper Format For ML Algorithm

c_df[['Revenue']] = c_df[['Revenue']].astype(int)
c_df['Weekend'] = c_df['Weekend'].astype(int)
c_df = c_df.replace({'VisitorType' : { 'Returning_Visitor' : '1', 
                                       'New_Visitor' : '0', 
                                       'Other' : '0' }})
c_df['VisitorType'] = c_df['VisitorType'].astype(int)

c_df = c_df.replace({'Month': { 'Feb':'1', 'Mar':'2', 'May':'4', 
                               'June':'5', 'Jul':'6', 'Aug':'7', 
                               'Sep':'8', 'Oct':'9', 'Nov':'10', 'Dec':'11'}})
c_df['Month'] = c_df['Month'].astype(int)

In [147]:
c_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  int64  
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

In [148]:
edf = c_df.drop('Revenue', 1)

In [149]:
# Convert df to list and preserve each columns datatype

evidence = list(list(x) for x in zip(*(edf[x].values.tolist() for x in edf.columns)))

In [151]:
print(len(evidence))
print(evidence[0])

12330
[0, 0.0, 0, 0.0, 1, 0.0, 0.2, 0.2, 0.0, 0.0, 1, 1, 1, 1, 1, 1, 0]


In [132]:
labels = c_df['Revenue'].tolist()

In [152]:
print(len(labels))
print(labels[0])

12330
0


## Main Program

In [204]:
import csv
import sys
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

TEST_SIZE = 0.4

def main():

    # Load data from spreadsheet and split into train and test sets
    evidence, labels = load_data('shopping.csv')
    X_train, X_test, y_train, y_test = train_test_split(
        evidence, labels, test_size=TEST_SIZE
    )

    # Train model and make predictions
    model = train_model(X_train, y_train)
    predictions = model.predict(X_test)
    sensitivity, specificity = evaluate(y_test, predictions)

    # Print results
    print(f"Correct: {(y_test == predictions).sum()}")
    print(f"Incorrect: {(y_test != predictions).sum()}")
    print(f"True Positive Rate: {100 * sensitivity:.2f}%")
    print(f"True Negative Rate: {100 * specificity:.2f}%")


def load_data(filename):
    """
    Load shopping data from a CSV file `filename` and convert into a list of
    evidence lists and a list of labels. Return a tuple (evidence, labels).

    evidence should be a list of lists, where each list contains the
    following values, in order:
        - Administrative, an integer
        - Administrative_Duration, a floating point number
        - Informational, an integer
        - Informational_Duration, a floating point number
        - ProductRelated, an integer
        - ProductRelated_Duration, a floating point number
        - BounceRates, a floating point number
        - ExitRates, a floating point number
        - PageValues, a floating point number
        - SpecialDay, a floating point number
        - Month, an index from 0 (January) to 11 (December)
        - OperatingSystems, an integer
        - Browser, an integer
        - Region, an integer
        - TrafficType, an integer
        - VisitorType, an integer 0 (not returning) or 1 (returning)
        - Weekend, an integer 0 (if false) or 1 (if true)

    labels should be the corresponding list of labels, where each label
    is 1 if Revenue is true, and 0 otherwise.
    """
    df = pd.read_csv(filename)
    c_df = df.copy()
    
    # Convert columns to proper formats per specs above
    c_df[['Revenue']] = c_df[['Revenue']].astype(int)
    c_df['Weekend'] = c_df['Weekend'].astype(int)
    c_df = c_df.replace({'VisitorType' : { 'Returning_Visitor' : '1', 
                                           'New_Visitor' : '0', 
                                           'Other' : '0' }})
    c_df['VisitorType'] = c_df['VisitorType'].astype(int)

    c_df = c_df.replace({'Month': { 'Feb':'1', 'Mar':'2', 'May':'4', 
                                   'June':'5', 'Jul':'6', 'Aug':'7', 
                                   'Sep':'8', 'Oct':'9', 'Nov':'10', 'Dec':'11'}})
    c_df['Month'] = c_df['Month'].astype(int)
    
    # Remove the 'Revenue' column for evidence list
    edf = c_df.drop('Revenue', 1)
    
    # Create evidence list by converting edf dataframe and preserving datatypes
    evidence = list(list(x) for x in zip(*(edf[x].values.tolist() for x in edf.columns)))
    
    # Create labels list
    labels = c_df['Revenue'].tolist()
    
    #print(evidence[0])
    #print(labels[0])
    
    return (evidence, labels)

def train_model(evidence, labels):
    """
    Given a list of evidence lists and a list of labels, return a
    fitted k-nearest neighbor model (k=1) trained on the data.
    """
    model = KNeighborsClassifier(n_neighbors=1)
    
    fitted_model = model.fit(evidence, labels)
    
    #print(fitted_model)
    return(fitted_model)


def evaluate(labels, predictions):
    """
    Given a list of actual labels and a list of predicted labels,
    return a tuple (sensitivity, specificty).

    Assume each label is either a 1 (positive) or 0 (negative).

    `sensitivity` should be a floating-point value from 0 to 1
    representing the "true positive rate": the proportion of
    actual positive labels that were accurately identified.

    `specificity` should be a floating-point value from 0 to 1
    representing the "true negative rate": the proportion of
    actual negative labels that were accurately identified.
    """
    
    # labels are from y-test and represent the actual label values
    #print(f'Number of Actual Labels: {len(labels)}')
    #print(f'Number of Predicted Labels: {len(predictions)}')
    
    total_real_0 = 0
    total_real_1 = 0
    total_pred_0_correct = 0
    total_pred_0_incorrect = 0
    total_pred_1_correct = 0 
    total_pred_1_incorrect = 0
        
    for i, real_val in enumerate(labels):
        if real_val == 1:
            total_real_1 += 1
            if real_val == predictions[i]:
                total_pred_1_correct += 1
            else:
                total_pred_1_incorrect += 1
        if real_val == 0: 
            total_real_0 += 1
            if real_val == predictions[i]:
                total_pred_0_correct += 1
            else:
                total_pred_0_incorrect += 1
    
#     print(f'Total Real Values: {total_real_0 + total_real_1}')
#     print(f'Total Real 1: {total_real_1}')
#     print(f'Total Real 0: {total_real_0}')
    
#     print(f'Total Predicted 1 Correct: {total_pred_1_correct}')
#     print(f'Total Predicted 1 InCorrect: {total_pred_1_incorrect}')
#     print(f'Total Predicted 0 Correct: {total_pred_0_correct}')
#     print(f'Total Predicted 0 InCorrect: {total_pred_0_incorrect}')
       
    sens = total_pred_1_correct/ total_real_1 
    spec = total_pred_0_correct/ total_real_0
    return sens, spec


In [205]:
main()

Correct: 4107
Incorrect: 825
True Positive Rate: 41.72%
True Negative Rate: 90.78%
