# DS-SF-30 | Unit Project 1: Research Design

In this first unit project you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will convert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."

> ### Question 1.  What is the outcome?

Answer: whether a free-tier customers will convert to paying customers (Yes or No)

> ### Question 2.  What are the predictors/covariates?

Answer: age, gender, location, profession, days since last log in, activity score

> ### Question 3.  What timeframe is this data relevent for?

Answer: January-April 2015

> ### Question 4.  What is the hypothesis?

Answer: Demographic (age, gender, location and profession) and usage information (days since last log in or activity score) will allow us to predict whether a free-tier customer will convert to paying customers.

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [3]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.

Answer:

Variable | Description | Type of Variable
---|---|---
admit | 0 = Not admitted, 1 = admitted | Categorical
gre | GRE test score | Continuous
gpa | Undergrad GPA | Continuous
prestige | Undergrad school prestige | Categorical

We would like to explore the association between GRE score, GPA and school prestige and probability of admission.

> ### Question 6.  What is the outcome?

Answer: whether a student got admitted into UCLA. (Yes or No)

> ### Question 7.  What are the predictors/covariates?

Answer: GRE score, GPA and undergraduate school prestige

> ### Question 8.  What timeframe is this data relevent for?

Answer: the specific time frame admission data is collected from

> ### Question 9.  What is the hypothesis?

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: Whether a students' chance of getting admitted into UCLA graduate school is associated with their GRE score, GPA and undergraduate school prestige

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: The goal of exploratory analysis is to summarize main characteristics of UCLA datasets with visual methods. Investigate the distributions of GRE score, GPA, undergraduate school prestige and admission rate and understand what type of variable transformations are needed in order to run a logistic regression model. 

> ### Question 12.  What are the assumptions of the distribution of data?

Answer: If we were to apply a logistic regression in this case, we do not need to make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms regarding linearity, normality, homoscedasticity, and measurement level. 

> ### Question 13.  How will determine the distribution of your data?

Answer: Visualization using histogram, boxplot, bivariate scatterplot or even 3-D scatterplot to understand univariate distributions of each individual predictor and their relationships with the outcome.

> ### Question 14.  How might outliers impact your analysis?

Answer: Logistic regression model is fitted with maximum likelihood and is extremely sensitive to bad data obtained from observational studies. The outliers are the observations for these cases are well separated from the remainder of the data such as extremely low/high GPA or GRE score. These outlying cases may involve large residuals and often have dramatic effects on the fitted maximum likelihood linear predictor. It is important to study the outliers carefully and decide whether they should be retained or eliminated and if retained, whether their influence should be reduced in the fitting process and/ or the logistic regression model should be revised

> ### Question 15.  How will you test for outliers?

Answer: The best way to test for outliers will be using boxplot in matplotlib which will identify outliers that were more than 1.5 IQR away from 25 or 75 percentile of the selected variable distribution. Any way would histogram or density plot to understand whether there were certain non-frequent cases on the two tails of the variable distribution.

> ### Question 16.  What is colinearity?

Answer: Collinearity refers to predictors that are correlated with other predictors.  It occurs when regression model includes multiple factors that are correlated not just to the response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.

> ### Question 17.  How will you test for covariance?

Answer: In order to test for covariance, I need to measure the joint variability of two random variables. The best way  would be a scatterplot of two continuous numerical variables or interaction plots of categorical variables. Using UCLA data as an example, if higher GPA mainly correspond with higher GRE, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. In the opposite case, when higher GPA mainly correspond to lower GRE, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between GRE and GPA.

> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer: Clearly define the time frame of study, in this case, would be what academic years' admission data is being used for study. Build a comprehensive data dictionary explains what are the features available and what is the outcome I am trying to predict. Be clear that this is an observational study which can only uncover association not causation between predictors and outcome. Choose logistic regression as my main methodology and also be clear that I am going to choose stepwise selection to find the final model. When split the data into train and test, choose 70:30 as split rule and pick a random seed so that my colleague will be able to obtain the same set of training and validation datasets as me. Once finish the model validation step using metrics like AUC or F1 error, document all findings and steps in a nicely written document so that my colleague can follow each step I did one year from now. Last but not least, save the development dataset at a secured and accessible location for my colleague so that he will be able to access the data whenever there is a need. 