# Introduction to Data Science
## Homework 2

Student Name: Ramya Dhatri Vunikili

Student Netid: rdv253
***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

I Goal:
Predict whether or not a woman is pregnant

II Predictive Model Attributes:
Below are a few attributes I've taken into consideration to come to a conclusion about the model I would like to use for this scenario:
1) Classification
Since there are two possible answers to our question I would like to use a classification model and predict P(Y|X) where
Y = Pregnant/Not Pregnant        and          X = Items bought from Target
2) Discriminative
As Target, already, has abundant data which can be used to make the calculate P(Y|X) instead of, indirectly, estimating P(X|Y) first and then using it to come to estimate P(Y|X).
3) Entropy
I'm looking to maximize the entropy while not assuming that the features are conditionally independent. In other words, the probability of a woman being pregnant would be higher given that she buys both vitamins and large amounts of sanitary supplies rather than that in the case where she buys just the vitamins. Also, as we have the testable information (training set collected at Target) we can have the probability distribution that maximizes the information entropy.

III Choosing The Appropriate Model:
The appropriate classifiers we can choose from for predicting a woman is pregnant or not in this scenario are logistic regression, decision tree and support vector machines. We can model using a logistic regression as it satisfies all the above attributes. Also, multi-collinearity can be dealt with L2 regularization in case of a logistic regression.
Decision trees, on the other hand, is based on a greedy algorithm which tends to make a decision at each step rather than looking at the bigger picture. For example, maternity wear could be a good feature to start the decision tree with. But, a similar concrete decision about pregnancy cannot be made if a woman buys a good body lotion. It's not unusual for women to buy good amounts of lotion even when she is not pregnant. This might result in not selecting the best possible tree (as a whole) as the tree tends to grow and also could result in a highly biased model.
By choosing the logistic regression as our baseline model we can improve the performance further by adding an SVM. SVMs can overcome the disadvantage of logistic regression in case of non-linear interaction between the features. 
Hence, I would build my baseline model on logistic regression and include SVMs for boosting the performance above the vanilla performance.


### Part 2: Exploring data in the command line
For this part we will be using the data file located in `"advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell in your terminal and then just paste your answers here. Recall that once you enter the "!" then filename completion should work. Also, these are standard data exploration commands that are quick and easy to use in a terminal or in the notebook. We don't cover command line operations formally in this class, but these are worth learning (and thus are part of the HW). Be resourceful. Use whatever online cheat sheets or Stackoverflow to answer the question.]

1\. How many records (lines) are in this file (look up wc)?

In [1]:
# Place your code here
!cat advertising_events.csv | wc -l 

10341


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [2]:
# Place your code here
!cat advertising_events.csv | cut -d, -f1  | sort | uniq | wc -l

732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [3]:
# Place your code here
!cat advertising_events.csv | cut -d, -f3 | sort | uniq -c | sort -r

   3114 google.com
   2092 facebook.com
   1036 youtube.com
   1034 yahoo.com
   1022 baidu.com
    513 wikipedia.org
    511 amazon.com
    382 qq.com
    321 twitter.com
    316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [4]:
# Place your code here
!grep -w 37 advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically

In [5]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [6]:
# Place your code here
ads = pd.DataFrame.from_csv("ads_dataset.tsv", sep='	')
ads.fillna(np.NaN)
ads

Unnamed: 0,isbuyer,buy_freq,visit_freq,buy_interval,sv_interval,expected_time_buy,expected_time_visit,last_buy,last_visit,multiple_buy,multiple_visit,uniq_urls,num_checkins,y_buy
NaT,0,,1,0.0,0.000000,0.0,0.000000,106,106,0,0,169,2130,0
NaT,0,,1,0.0,0.000000,0.0,0.000000,72,72,0,0,154,1100,0
NaT,0,,1,0.0,0.000000,0.0,0.000000,5,5,0,0,4,12,0
NaT,0,,1,0.0,0.000000,0.0,0.000000,6,6,0,0,150,539,0
NaT,0,,2,0.0,0.500000,0.0,-101.149300,101,101,0,1,103,362,0
NaT,0,,1,0.0,0.000000,0.0,0.000000,42,42,0,0,17,35,0
NaT,0,,1,0.0,0.000000,0.0,0.000000,42,42,0,0,42,110,0
NaT,0,,2,0.0,29.791670,0.0,-106.188300,121,121,0,1,101,401,0
NaT,0,,3,0.0,45.479170,0.0,-34.144730,64,64,0,1,100,298,0
NaT,0,,1,0.0,0.000000,0.0,0.000000,13,13,0,0,53,247,0


2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [10]:
def getDfSummary(input_data):
    # Place your code here
    output_data = pd.DataFrame(columns=['number_nan','number_distinct'])
    output_data['number_nan']= input_data.isnull().sum()
    output_data['number_distinct']= input_data.apply(pd.Series.nunique)
    temp = (input_data.describe()).transpose()
    output_data = output_data.join(temp[['mean','max','min','std','25%','50%','75%']])
    return output_data
getDfSummary(ads)

Unnamed: 0,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
isbuyer,0,2,0.042632,1.0,0.0,0.202027,0.0,0.0,0.0
buy_freq,52257,10,1.240653,15.0,1.0,0.782228,1.0,1.0,1.0
visit_freq,0,64,1.852777,84.0,0.0,2.92182,1.0,1.0,2.0
buy_interval,0,295,0.210008,174.625,0.0,3.922016,0.0,0.0,0.0
sv_interval,0,5886,5.82561,184.9167,0.0,17.595442,0.0,0.0,0.104167
expected_time_buy,0,348,-0.19804,84.28571,-181.9238,4.997792,0.0,0.0,0.0
expected_time_visit,0,15135,-10.210786,91.40192,-187.6156,31.879722,0.0,0.0,0.0
last_buy,0,189,64.729335,188.0,0.0,53.476658,18.0,51.0,105.0
last_visit,0,189,64.729335,188.0,0.0,53.476658,18.0,51.0,105.0
multiple_buy,0,2,0.006357,1.0,0.0,0.079479,0.0,0.0,0.0


3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [8]:
# Place your code here
%timeit getDfSummary(ads)

10 loops, best of 3: 91.8 ms per loop


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [9]:
# Place your code here
print ads.columns[ads.isnull().any()].tolist()

['buy_freq']


5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [11]:
# Place your code here
ads_missing = ads.loc[ads['buy_freq'].isnull()]
getDfSummary(ads_missing)

Unnamed: 0,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
isbuyer,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
buy_freq,52257,0,,,,,,,
visit_freq,0,48,1.651549,84.0,1.0,2.147955,1.0,1.0,2.0
buy_interval,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sv_interval,0,5112,5.686388,184.9167,0.0,17.623555,0.0,0.0,0.041667
expected_time_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
expected_time_visit,0,13351,-9.669298,91.40192,-187.6156,31.23903,0.0,0.0,0.0
last_buy,0,189,65.741317,188.0,0.0,53.484622,19.0,52.0,106.0
last_visit,0,189,65.741317,188.0,0.0,53.484622,19.0,52.0,106.0
multiple_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The variables 'isbuyer', 'buy_interval', 'expected_time_buy' and 'multiple_buy' are all zeros when the 'buy_freq' is NaN. These seem to be correlated because of the reason that if a person is classified as not a buyer then we, essentially, do not have a record of his/her prior purchases and hence cannot deduce any information about their previous purchases.This explains why his/her buying frequency is NaN and also why there are no predictions made about their expected buying time.

6\. Which variables are binary?

In [12]:
# Place your code here
ads_stats = getDfSummary(ads)
ads_stats.index[(ads_stats.number_distinct == 2) & (ads_stats['min'] == 0) & (ads_stats['max'] == 1)].tolist()

['isbuyer', 'multiple_buy', 'multiple_visit', 'y_buy']