# **BACKGROUND**

When users are shopping online, not all will end up purchasing something. Most visitors to an online shopping website, in fact, likely don’t end up going through with a purchase during that web browsing session. It might be useful, though, for a shopping website to be able to predict whether a user intends to make a purchase or not: perhaps displaying different content to the user, like showing the user a discount offer if the website believes the user isn’t planning to complete the purchase. How could a website determine a user’s purchasing intent? That’s where machine learning will come in.

Your task in this problem is to build a nearest-neighbor classifier to solve this problem. Given information about a user — how many pages they’ve visited, whether they’re shopping on a weekend, what web browser they’re using, etc. — your classifier will predict whether or not the user will make a purchase. Your classifier won’t be perfectly accurate — perfectly modeling human behavior is a task well beyond the scope of this class — but it should be better than guessing randomly. To train your classifier, we’ll provide you with some data from a shopping website from about 12,000 users sessions.

How do we measure the accuracy of a system like this? If we have a testing data set, we could run our classifier on the data, and compute what proportion of the time we correctly classify the user’s intent. This would give us a single accuracy percentage. But that number might be a little misleading. Imagine, for example, if about 15% of all users end up going through with a purchase. A classifier that always predicted that the user would not go through with a purchase, then, we would measure as being 85% accurate: the only users it classifies incorrectly are the 15% of users who do go through with a purchase. And while 85% accuracy sounds pretty good, that doesn’t seem like a very useful classifier.


Instead, we’ll measure two values: sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”). Sensitivity refers to the proportion of positive examples that were correctly identified: in other words, the proportion of users who did go through with a purchase who were correctly identified. Specificity refers to the proportion of negative examples that were correctly identified: in this case, the proportion of users who did not go through with a purchase who were correctly identified. So our “always guess no” classifier from before would have perfect specificity (1.0) but no sensitivity (0.0). Our goal is to build a classifier that performs reasonably on both metrics.



# ***Intro***

Download the distribution code from https://cdn.cs50.net/ai/2020/x/projects/4/shopping.zip and unzip it.

Run pip3 install scikit-learn to install the scikit-learn package if it isn’t already installed, which you’ll need for this project.

In [26]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
from pandas_profiling import ProfileReport

In [28]:
import pandas as pd
df=pd.read_csv('/content/drive/MyDrive/shopping.csv')
df

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12326,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12327,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12328,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,Nov,2,2,3,11,Returning_Visitor,False,False


In [29]:
df.shape # Data Frame Dim. Size확인

(12330, 18)

In [30]:
df.describe(), df.describe # descritive statistics 확인 및 nan 또는 outlier확인

(       Administrative  Administrative_Duration  Informational  \
 count    12330.000000             12330.000000   12330.000000   
 mean         2.315166                80.818611       0.503569   
 std          3.321784               176.779107       1.270156   
 min          0.000000                 0.000000       0.000000   
 25%          0.000000                 0.000000       0.000000   
 50%          1.000000                 7.500000       0.000000   
 75%          4.000000                93.256250       0.000000   
 max         27.000000              3398.750000      24.000000   
 
        Informational_Duration  ProductRelated  ProductRelated_Duration  \
 count            12330.000000    12330.000000             12330.000000   
 mean                34.472398       31.731468              1194.746220   
 std                140.749294       44.475503              1913.669288   
 min                  0.000000        0.000000                 0.000000   
 25%                  0.00000

First, open up shopping.csv, the data set provided to you for this project. You can open it in a text editor, but you may find it easier to understand visually in a spreadsheet application like Microsoft Excel, Apple Numbers, or Google Sheets.

# **Legend**

The first six columns measure the different types of pages users have visited in the session: the Administrative, Informational, and ProductRelated columns measure how many of those types of pages the user visited, and their corresponding _Duration columns measure how much time the user spent on any of those pages. The BounceRates, ExitRates, and PageValues columns measure information from Google Analytics about the page the user visited. SpecialDay is a value that measures how close the date of the user’s session is to a special day (like Valentine’s Day or Mother’s Day). Month is an abbreviation of the month the user visited. OperatingSystems, Browser, Region, and TrafficType are all integers describing information about the user themself. VisitorType will take on the value Returning_Visitor for returning visitors and some other string value for non-returning visitors. Weekend is TRUE or FALSE depending on whether or not the user is visiting on a weekend.


Perhaps the most important column, though, is the last one: the Revenue column. This is the column that indicates whether the user ultimately made a purchase or not: TRUE if they did, FALSE if they didn’t. This is the column that we’d like to learn to predict (the “label”), based on the values for all of the other columns (the “evidence”).


Instead, we’ll measure two values: sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”). Sensitivity refers to the proportion of positive examples that were correctly identified: in other words, the proportion of users who did go through with a purchase who were correctly identified. Specificity refers to the proportion of negative examples that were correctly identified: in this case, the proportion of users who did not go through with a purchase who were correctly identified. So our “always guess no” classifier from before would have perfect specificity (1.0) but no sensitivity (0.0). Our goal is to build a classifier that performs reasonably on both metrics.



---



# ***Specification***
An automated tool assists the staff in enforcing the constraints in the below specification. Your submission will fail if any of these are not handled properly, if you import modules other than those explicitly allowed, or if you modify functions other than as permitted.



---





Complete the implementation of load_data, train_model, and evaluate in shopping.py.

The load_data function should accept a CSV filename as its argument, open that file, and return a tuple (evidence, labels). evidence should be a list of all of the evidence for each of the data points, and labels should be a list of all of the labels for each data point.

Since you’ll have one piece of evidence and one label for each row of the spreadsheet, the length of the evidence list and the length of the labels list should ultimately be equal to the number of rows in the CSV spreadsheet (excluding the header row). The lists should be ordered according to the order the users appear in the spreadsheet. That is to say, evidence[0] should be the evidence for the first user, and labels[0] should be the label for the first user.
Each element in the evidence list should itself be a list. The list should be of length 17: the number of columns in the spreadsheet excluding the final column (the label column).
The values in each evidence list should be in the same order as the columns that appear in the evidence spreadsheet. You may assume that the order of columns in shopping.csv will always be presented in that order.
Note that, to build a nearest-neighbor classifier, all of our data needs to be numeric. Be sure that your values have the following types:

In [31]:
df.dtypes #데이타 타입 확인

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object



> 들여쓴 블록
Administrative, Informational, ProductRelated, Month, OperatingSystems, Browser, Region, TrafficType, VisitorType, and Weekend should all be of type int

> 들여쓴 블록 Administrative_Duration, Informational_Duration, ProductRelated_Duration, BounceRates, ExitRates, PageValues, and SpecialDay should all be of type float.

> 들여쓴 블록 Month should be 0 for January, 1 for February, 2 for March, etc. up to 11 for December.


> 들여쓴 블록 VisitorType should be 1 for returning visitors and 0 for non-returning visitors.

> 들여쓴 블록 Weekend should be 1 if the user visited on a weekend and 0 otherwise.



In [None]:
#테이타 type 변경
train_csv=df
train_csv['Weekend'] = train_csv['Weekend'].astype('category')
train_csv['VisitorType'] = train_csv['VisitorType'].astype('category')
train_csv['Administrative']=train_csv['Administrative'].astype('int') 
train_csv['Informational']=train_csv['Informational'].astype('int')
train_csv['ProductRelated']=train_csv['ProductRelated'].astype('int')
train_csv['OperatingSystems']=train_csv['OperatingSystems'].astype('int')
train_csv['SpecialDay']=train_csv['SpecialDay'].astype('float') 


#Month feature 숫자로 지정
"""Month should be 0 for January, 1 for February, 2 for March, etc. up to 11 for December"""
#VisitorType,Revenue 를 0또는 1로 지정
"""VisitorType 1 for returning visitors  0 for non-returning visitor"""
#변수 선언


rev_dataset = pd.DataFrame()
df = pd.DataFrame(data=features, columns=rev_dataset.feature_names)
df['target'] = target

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=2)
print(X_train.shape, X_test.shape)

X_train = pd.DataFrame(X_train, columns = rev_dataset.feature_names)
X_test = pd.DataFrame(X_test, columns = rev_dataset.feature_names)
    

Each value of labels should either be the integer 1, if the user did go through with a purchase, or 0 otherwise.
For example, the value of the first evidence list should be [0, 0.0, 0, 0.0, 1, 0.0, 0.2, 0.2, 0.0, 0.0, 1, 1, 1, 1, 1, 1, 0] and the value of the first label should be 0.

In [None]:
"""# 불필요한 라벨 드랍
train_csv.drop(columns='Weekend', inplace=True)"""

In [None]:
train_csv.columns

In [None]:
"""# 정규화가 불필요한 Columns Drop
train_csv.drop(columns=['Month','Weekend','VisitorType','SpecialDay' inplace=True]"""

In [None]:
"""# save the preprocessed data
train.to_csv('./content/drive/MyDrive/shopping_preprocessed.csv')"""

In [36]:
# Pandas Profiling 불러오기
from pandas_profiling import ProfileReport


In [None]:
profile = ProfileReport(train_csv, minimal=True)
profile

In [None]:
import pandas as pd

df_data = pd.DataFrame(rev_dataset.data)
df_labels = pd.DataFrame(rev_dataset.target)

In [None]:
df_labels.head()

In [41]:
#데이타 정규화 필요에 의해 Normalization ? 필요한가!
"""def min_max_normalize(lst):

    normalized = []
    
    for value in lst:
        normalized_num = (value - min(lst)) / (max(lst) - min(lst))
        normalized.append(normalized_num)
    
    return normalized"""

In [None]:
"""for x in range(len(df_data.columns)):
    df_data[x] = min_max_normalize(df_data[x])

df_data.describe()"""

In [4]:
#변수
def evaluate(labels, predictions): 
     tp = 0
     fn = 0
     fp = 0
     tn = 0
     for actual, prediction in zip(labels, predictions):
        if actual == prediction:
            if actual == 1:
                tp += 1
            else:
                tn += 1
        else:
            if actual == 1:
                if prediction == 0:
                    fp += 1
                else:
                    fn += 1
     return tp / (tp + fn), tn / (tn + fp)

In [None]:
from sklearn.neighbors import KNeighborsClassifier #모든 데이타가 수치데이타이므로 KNeighborsClassifier를 사용하여 보기로함 (k=1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pydot as plt
"""# 훈련 / 검증 분리
target = 'Revenue'
from sklearn.model_selection import train_test_split

training_data, validation_data , training_labels, validation_labels = train_test_split(df_data, df_labels, test_size = 0.2, random_state = 100)
train, val = train_test_split(train_csv, train_size=0.80, test_size=0.20, 
                              stratify=train_csv[target], random_state=2)

train.shape, val.shape, test.shape

print(len(training_data)) # train test split된 데이타의 size(갯수)
print(len(validation_data))
print(len(training_labels))
print(len(validation_labels))"""

"""# 타겟 비율 확인 -> 0 범주 비율이 높다 (기준 모델로 지정)
train[target].value_counts(normalize=True)

# 특성 분리?
features = train.drop(columns=[target]).columns"""

#모델 Building: k값을 1~7까지 해봄

classifier=KNeighborsClassifier(n_neighbors=3) 


labels='Revenue'
X_train,X_test,y_train,y_test = train_test_split(labels.load_data,labels.target, test_size=0.3,stratified=labels.target,random_states=100)

training_accuracy=[]
test_accuracy=[]
neighbors_settings=range(1.101)
for n_neighbors in neighbors_settings:
    clf=KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train,y_train)
    training_accuracy.append(classifier.score(X_train,y_train))
    test_accuracy.append(classifier.score(X_test,y_test))

classifier.fit(training_data,target_labels) # 데이타를 KNeighbors Classifier ML모델에 학습


In [None]:
"""print(classifier.score(validation_data,validation_labels)) #검증세트의 테스트 정확도 평가"""

ML Model 평가

In [None]:
#결과 시각화
plt.plot(neighbors_settings,training_accuracy,label='train_accuracy')
plt.plot(neighbors_settings,test_accuracy,label='test_accuracy')
plt.ylabel('accuracy')
plt.xlabel('n_neighbors')
plt.legend()
plt.show()



In [None]:
# 결과 시각화
import matplotlib.pyplot as plt
k_list = range(1,101)
accuracies = []
for k in k_list:
  classifier = KNeighborsClassifier(n_neighbors = k)
  classifier.fit(training_data, training_labels)
  accuracies.append(classifier.score(validation_data, validation_labels))

plt.plot(k_list, accuracies)
plt.xlabel("k")
plt.ylabel("Validation Accuracy")
plt.title("Shopping Classifier Accuracy")
plt.show()

k가 커질수록 정확도가 낮아진것 확인 k가 1 일때 best

In [None]:
# True Positive 즉 Sensitivity와 True Negative 즉 Specificity를 구하는 EVALUATE 함수를 통하여 (sensitivity, specificity)를  구한다.


def evaluate(labels, predictions):
    """
    Given a list of actual labels and a list of predicted labels,
    return a tuple (sensitivity, specificty).
    Assume each label is either a 1 (positive) or 0 (negative).
    `sensitivity` should be a floating-point value from 0 to 1
    representing the "true positive rate": the proportion of
    actual positive labels that were accurately identified.
    `specificity` should be a floating-point value from 0 to 1
    representing the "true negative rate": the proportion of
    actual negative labels that were accurately identified.
    """
    sensitivity = float(0)
    specificity = float(0)

    total_positive = float(0)
    total_negative = float(0)

    for label, prediction in zip(labels, predictions):

        if label == 1:
            total_positive += 1
            if label == prediction:
                sensitivity += 1

        if label == 0:
            total_negative += 1
            if label == prediction:
                specificity += 1

    sensitivity /= total_positive
    specificity /= total_negative

    return sensitivity, specificity

"""print(classification_report(y_val, y_pred))"""



---



The train_model function should accept a list of evidence and a list of labels, and return a scikit-learn nearest-neighbor classifier (a k-nearest-neighbor classifier where k = 1) fitted on that training data.

Notice that we’ve already imported for you from sklearn.neighbors import KNeighborsClassifier. You’ll want to use a KNeighborsClassifier in this function.



---



The evaluate function should accept a list of labels (the true labels for the users in the testing set) and a list of predictions (the labels predicted by your classifier), and return two floating-point values (sensitivity, specificity).

sensitivity should be a floating-point value from 0 to 1 representing the “true positive rate”: the proportion of actual positive labels that were accurately identified.
specificity should be a floating-point value from 0 to 1 representing the “true negative rate”: the proportion of actual negative labels that were accurately identified.
You may assume each label will be 1 for positive results (users who did go through with a purchase) or 0 for negative results (users who did not go through with a purchase).
You may assume that the list of true labels will contain at least one positive label and at least one negative label.

You should not modify anything else in shopping.py other than the functions the specification calls for you to implement, though you may write additional functions and/or import other Python standard library modules. You may also import numpy or pandas or anything from scikit-learn, if familiar with them, but you should not use any other third-party Python modules. You should not modify shopping.csv.

In [None]:
# ML MODEL로 KNeighborsClassifier가 적합한지 분류결과를 평가하여 보기로 함

In [None]:
# 실제 구매한 유저와 실제로 구매하지 않은 유저를 예측한 결과는 sensitivity / specificity 두개의 metrics에 합리적인지로 평가한다

In [None]:
# 평가결과 : True / False

# ***How to Submit***
You may not have your code in your ai50/projects/2020/x/shopping branch nested within any further subdirectories (such as a subdirectory called shopping or project4a). That is to say, if the staff attempts to access https://github.com/me50/USERNAME/blob/ai50/projects/2020/x/shopping/shopping.py, where USERNAME is your GitHub username, that is exactly where your file should live. If your file is not at that location when the staff attempts to grade, your submission will fail.

Visit this link, log in with your GitHub account, and click Authorize cs50. Then, check the box indicating that you’d like to grant course staff access to your submissions, and click Join course.
Install Git and, optionally, install submit50.
If you’ve installed submit50, execute
submit50 ai50/projects/2020/x/shopping
Otherwise, using Git, push your work to https://github.com/me50/USERNAME.git, where USERNAME is your GitHub username, on a branch called ai50/projects/2020/x/shopping.

Submit this form.
You can then go to https://cs50.me/cs50ai to view your current progress!

# ***How to Get Help***

Ask questions via Ed!

Ask questions via any of CS50’s communities!

# ***✈ SPECIAL THANKS TO MoNa for RECOMMENDATION OF HARVARD LECTURE CS50 AI WITH PYTHON***


---


# **Acknowledgements**

Data set provided by Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018)