# Introduction to Data Science
## Homework 2

Student Name: Sanjay Subramanian

Student Netid: ss14383
***

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the data mining process, and be sure to include the motivation for predictive modeling and give a sketch of a solution.  Be precise but concise.

When dealing with a business problem akin to that which Target faced in the early 2000s, we can use data mining to structure a process which allows for reasonable consistency, repeatability, and objectiveness. 

First and foremost, a business understanding allows for a formal problem definition. In this case, Target's question would be "Can we predict a female customer's pregnancy status based on items they have recently purchased?" Given the flexible shopping habits of expecting parents, information like this is worth millions to retail stores hoping to attract customers for an extended period of time, and thus predictive modeling is highly utilizable in this scenario. 

Data understanding and preparation comprises the next step, as data may not always be reliable to answer a prospective question, especially when dealing with historical data that may not be 100% relevant. In Target's case, the baby shower registry provided useful information in terms of due-date-sensitive shopping habits of pregnant women, which allowed Pole to assign shoppers a "pregnancy prediction score." 

Given the availability of the regsitry data and the binary output of the question at hand, the modeling step can be classified as a supervised classification problem. Specifically, as in the churn case model, we are dealing with a class probability estimation problem, with the target variable being a female customer's due date and features being items she (or maybe her spouse) buys.

A practical way to go about this is by using a logistic regression model, where the output will be the log-odds/probability of class membership, or in this case, the probability of a female customer being pregnant. As the sample size in a given registry data set is probably on the small end, logistic regression provides a more robust framework than say, a decision tree.

In order to gain confidence in any modeling, evaluation is essential in order to control for phenomena such as selection bias and concept drift. It is generally a good idea to undergo testing with training and validation datasets in order to quantitatively assess model performance. Once this stage is complete, deployment is necessary to achieve an ROI. In Target's case, this step could be combined with certain advertising practices in order to maximize customer accrual.

### Part 2: Exploring data in the command line (4 Points)
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use a bash shell (i.e., EC2 or a Mac terminal) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

[Here](https://opensource.com/article/17/2/command-line-tools-data-analysis-linux) is a good linux command line reference.

1\. How many records (lines) are in this file? (look up 'wc' command)

In [80]:
!wc -l advertising_events.csv

#10341 lines

2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [None]:
!cut -d',' -f 1 advertising_events.csv | sort | uniq | wc -l

#732 unique users

3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [None]:
!cut -d',' -f 3 advertising_events.csv | sort | uniq -c | sort -r

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
513 wikipedia.org
511 amazon.com
382 qq.com
321 twitter.com
316 taobao.com

4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [None]:
!grep -w 37 advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0

### Part 3: Dealing with data Pythonically (16 Points)

1\. (1 Point) Download the data set `"data/ads_dataset.tsv"` and load it into a Python Pandas data frame called `ads`.

In [81]:
# Place your code here
import numpy as np
import pandas as pd

ads = pd.read_csv('C:/Users/sanjs/Documents/DS-GA 1001/ads_dataset.tsv', sep = '\t')

2\. (4 Points) Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` method returns a useful series of values that can be used here.

In [53]:
def getDfSummary(ads):
    
    ads_stats = ads.describe().T
    del ads_stats['count']
    ads_stats['number_nan'] = ads.isnull().sum()
    ads_stats['number_distinct'] = ads.nunique()
    return ads_stats

ads_stats = getDfSummary(ads)

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.042632,0.202027,0.0,0.0,0.0,0.0,1.0,0,2
buy_freq,1.240653,0.782228,1.0,1.0,1.0,1.0,15.0,52257,10
visit_freq,1.852777,2.92182,0.0,1.0,1.0,2.0,84.0,0,64
buy_interval,0.210008,3.922016,0.0,0.0,0.0,0.0,174.625,0,295
sv_interval,5.82561,17.595442,0.0,0.0,0.0,0.104167,184.9167,0,5886
expected_time_buy,-0.19804,4.997792,-181.9238,0.0,0.0,0.0,84.28571,0,348
expected_time_visit,-10.210786,31.879722,-187.6156,0.0,0.0,0.0,91.40192,0,15135
last_buy,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
last_visit,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
multiple_buy,0.006357,0.079479,0.0,0.0,0.0,0.0,1.0,0,2


3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `use %timeit`

In [54]:
%timeit getDfSummary(ads)
#58.7 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

58.7 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. (2 Points) Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [85]:
ads_stats.number_nan
#The buy_freq field contains 52257 NaN values

isbuyer                    0
buy_freq               52257
visit_freq                 0
buy_interval               0
sv_interval                0
expected_time_buy          0
expected_time_visit        0
last_buy                   0
last_visit                 0
multiple_buy               0
multiple_visit             0
uniq_urls                  0
num_checkins               0
y_buy                      0
Name: number_nan, dtype: int64

5\. (4 Points) For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or make it more likely that the data is missing? If missing, what should the data value be? Don't just show code here. Please explain your answer.[Edit this to ask for more details on why they are 0]

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [103]:
ads_null = ads[ads['buy_freq'].isnull()]
ads_null = getDfSummary(ads_null)
ads_null - ads_stats

#The values in buy_freq are NaN when isbuyer is 0 - the values of buy_freq rely on isbuyer. These vales could be converted to some other type of integer identifier

6\. (4 Points) Which variables are binary?

In [45]:
(ads_stats.number_distinct == 2).sum()
#4 binary variables

4