# Introduction to Data Science
## Homework 2

Student Name: Yurui Mu

Student Netid: ym1495
***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

Target's business motivation is to predict habit changing moments of their comsumers, especially from those who are expecting babies. The data preparation should include userID as consumers' ID, and consumers purchasing records. They should analyze the past data first and find items correlated to "new born babies", that is classification/cluster of comodities. And then if a customer's recent purchase have multiple overlaps with such categories, then we can get a probability of someone getting pregnant or not.

### Part 2: Exploring data in the command line
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell on EC2 (see original directions) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

1\. How many records (lines) are in this file?

In [4]:
!wc -l /Users/muriel820/Documents/data\ science\ hw/1001/data/advertising_events.csv

   10341 /Users/muriel820/Documents/data science hw/1001/data/advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [6]:
!cut -d ',' -f 1 /Users/muriel820/Documents/data\ science\ hw/1001/data/advertising_events.csv | sort | uniq |wc -l 

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [8]:
!cut -d ',' -f 3 /Users/muriel820/Documents/data\ science\ hw/1001/data/advertising_events.csv | sort | uniq -c | sort -nr

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [14]:
!grep "^37,[^,]*,[^,]*,[^,]*$" /Users/muriel820/Documents/data\ science\ hw/1001/data/advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically

In [17]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"data/ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [25]:
ads = pd.read_table('/Users/muriel820/Documents/data science hw/1001/data/ads_dataset.tsv')

2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [26]:
def getDfSummary(input_data):
    output_data = input_data.describe().transpose()
    number_nan = input_data.isnull().sum()
    number_distinct = input_data.apply(pd.Series.nunique)
    output_data['Number_nan'] = number_nan
    output_data['Number_distinct']= number_distinct
    
    return output_data

print (getDfSummary(ads))

                       count        mean          std       min    25%    50%  \
isbuyer              54584.0    0.042632     0.202027    0.0000    0.0    0.0   
buy_freq              2327.0    1.240653     0.782228    1.0000    NaN    NaN   
visit_freq           54584.0    1.852777     2.921820    0.0000    1.0    1.0   
buy_interval         54584.0    0.210008     3.922016    0.0000    0.0    0.0   
sv_interval          54584.0    5.825610    17.595442    0.0000    0.0    0.0   
expected_time_buy    54584.0   -0.198040     4.997792 -181.9238    0.0    0.0   
expected_time_visit  54584.0  -10.210786    31.879722 -187.6156    0.0    0.0   
last_buy             54584.0   64.729335    53.476658    0.0000   18.0   51.0   
last_visit           54584.0   64.729335    53.476658    0.0000   18.0   51.0   
multiple_buy         54584.0    0.006357     0.079479    0.0000    0.0    0.0   
multiple_visit       54584.0    0.277444     0.447742    0.0000    0.0    0.0   
uniq_urls            54584.0



3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [27]:
%timeit getDfSummary(ads)



10 loops, best of 3: 89.3 ms per loop


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [28]:
DfSummary = getDfSummary(ads)
DfSummary.columns[pd.isnull(DfSummary).any()].tolist()



['25%', '50%', '75%']

5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

    NaN should be 0. It's NaN when isbuyer is 0. So when someone is not a buyer, his or her buy_Freq should be 0.

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?
    
    buy_freq's mean and standard deviation changed dramatically.
   

In [29]:
new_ads=ads.fillna(0)
getDfSummary(new_ads)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,Number_nan,Number_distinct
isbuyer,54584.0,0.042632,0.202027,0.0,0.0,0.0,0.0,1.0,0,2
buy_freq,54584.0,0.052891,0.298157,0.0,0.0,0.0,0.0,15.0,0,11
visit_freq,54584.0,1.852777,2.92182,0.0,1.0,1.0,2.0,84.0,0,64
buy_interval,54584.0,0.210008,3.922016,0.0,0.0,0.0,0.0,174.625,0,295
sv_interval,54584.0,5.82561,17.595442,0.0,0.0,0.0,0.104167,184.9167,0,5886
expected_time_buy,54584.0,-0.19804,4.997792,-181.9238,0.0,0.0,0.0,84.28571,0,348
expected_time_visit,54584.0,-10.210786,31.879722,-187.6156,0.0,0.0,0.0,91.40192,0,15135
last_buy,54584.0,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
last_visit,54584.0,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
multiple_buy,54584.0,0.006357,0.079479,0.0,0.0,0.0,0.0,1.0,0,2


6\. Which variables are binary?

        They are either int or float. None of them is binary.

In [32]:
new_ads.dtypes

isbuyer                  int64
buy_freq               float64
visit_freq               int64
buy_interval           float64
sv_interval            float64
expected_time_buy      float64
expected_time_visit    float64
last_buy                 int64
last_visit               int64
multiple_buy             int64
multiple_visit           int64
uniq_urls                int64
num_checkins             int64
y_buy                    int64
dtype: object