# Introduction to Data Science
## Homework 2

Student Name: Yuhan Liu

Student Netid: yl7576
***

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the data mining process, and be sure to include the motivation for predictive modeling and give a sketch of a solution.  Be precise but concise.

Answer:

In the case, Target wants to attract more customers, but normally people will have fixed shopping habits which are hard to change. Therefore, Target wants to capture some unique moments in consumers’ lives when their shopping habits become flexible. For example, expecting a baby, graduating from colleage, or moving to a new city, etc. 

Here, Target wants to know whether a consumer is pregnant or not. To build such a pregnancy-prediction model, the main resource they have is historical shopping data. Therefore, they can first use Target’s baby-shower registry to know the due date of a pregnant woman and observe how shopping habits changed as a woman approached her due date, which can be the train test data for the prediction model. For example, at the beginning of pregnancy, women customers may start to buy vitamins and when approaching to due date, they will buy baby products. In this way, when Target collects a woman’s shopping history, if she buys a vitamin specifically for pregnant women, they can predict that the customer may just get pregnant and start to send advertisement or coupon according to different pregnancy stage. 

Also, it may seem to be cursory for Target to determine pregnancy by only one-time purchase. They should collect more data such as one’s browsing history on the website. If one customer fits most characteristics of pregnancy, the model can give her a high score and start to market some related products. 

### Part 2: Exploring data in the command line (4 Points)
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use a bash shell (i.e., EC2 or a Mac terminal) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

[Here](https://opensource.com/article/17/2/command-line-tools-data-analysis-linux) is a good linux command line reference.

1\. How many records (lines) are in this file? (look up 'wc' command)

In [124]:
!wc -l data/advertising_events.csv

   10341 /Users/yuhanliu/Downloads/advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [125]:
!cut -d',' -f1 data/advertising_events.csv | sort -u | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [126]:
!cut -d "," -f3 data/advertising_events.csv | sort | uniq -c | sort -nr

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [127]:
!grep ^37, data/advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically (16 Points)

1\. (1 Point) Download the data set `"data/ads_dataset.tsv"` and load it into a Python Pandas data frame called `ads`.

In [128]:
# Place your code here
import pandas as pd
import numpy as np
ads = pd.read_csv("ads_dataset.tsv",sep = '\t')
ads

Unnamed: 0,isbuyer,buy_freq,visit_freq,buy_interval,sv_interval,expected_time_buy,expected_time_visit,last_buy,last_visit,multiple_buy,multiple_visit,uniq_urls,num_checkins,y_buy
,0,,1,0.0,0.000000,0.0,0.000000,106,106,0,0,169,2130,0
,0,,1,0.0,0.000000,0.0,0.000000,72,72,0,0,154,1100,0
,0,,1,0.0,0.000000,0.0,0.000000,5,5,0,0,4,12,0
,0,,1,0.0,0.000000,0.0,0.000000,6,6,0,0,150,539,0
,0,,2,0.0,0.500000,0.0,-101.149300,101,101,0,1,103,362,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,0,,3,0.0,30.979170,0.0,12.621240,8,8,0,1,168,2080,0
,0,,2,0.0,1.041667,0.0,-0.916713,1,1,0,1,1,15,0
,0,,1,0.0,0.000000,0.0,0.000000,20,20,0,0,132,556,0
,0,,1,0.0,0.000000,0.0,0.000000,180,180,0,0,71,400,0


2\. (4 Points) Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` method returns a useful series of values that can be used here.

In [129]:
def getDfSummary(input_data):
    output_data = input_data.describe().transpose()
    output_data["number_distinct"] = [ads[i].dropna().nunique() for i in output_data.index.values]
    output_data['number_nan'] = [ads[i].isna().sum() for i in output_data.index.values ]
    return output_data

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `use %timeit`

In [130]:
%timeit getDfSummary(ads)

56.1 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. (2 Points) Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [131]:
result = getDfSummary(ads)
result.loc[result['number_nan'] > 0].index.values
# buy_freq contains missing NaN values

array(['buy_freq'], dtype=object)

5\. (4 Points) For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or make it more likely that the data is missing? If missing, what should the data value be? Don't just show code here. Please explain your answer.[Edit this to ask for more details on why they are 0]

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?


Answer: the data is not missing at random. All NA data come from buy_freq column and we should set the data value to be 0 because the customer did not purchase anything. That is why features related to purchase process, such as isbuyer, expected_time_buy, and multiple_buy, are all zero in the summary table. 

In [132]:
null_ads = ads[ads.isnull().any(axis=1)] 
getDfSummary(null_ads)
# explanation is above 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,number_distinct,number_nan
isbuyer,52257.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,0
buy_freq,0.0,,,,,,,,10,52257
visit_freq,52257.0,1.651549,2.147955,1.0,1.0,1.0,2.0,84.0,64,0
buy_interval,52257.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,295,0
sv_interval,52257.0,5.686388,17.623555,0.0,0.0,0.0,0.041667,184.9167,5886,0
expected_time_buy,52257.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,348,0
expected_time_visit,52257.0,-9.669298,31.23903,-187.6156,0.0,0.0,0.0,91.40192,15135,0
last_buy,52257.0,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,189,0
last_visit,52257.0,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,189,0
multiple_buy,52257.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,0


6\. (4 Points) Which variables are binary?

In [133]:
result.loc[result['number_distinct'] == 2].index.values
# Variables: isbuyer, multiple_buy, multiple_visit, y_buy are binary because they only have two distinct values

array(['isbuyer', 'multiple_buy', 'multiple_visit', 'y_buy'], dtype=object)