# DS-NYC-45 | Unit Project 1: Research Design Write-Up

In this first unit project you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."

> ### Question 1.  What is the outcome?

Answer: Convert to paying customer indicator (yes or no)

> ### Question 2.  What are the predictors/covariates?

Answer: age, gender, location, profession, days since last log in, activity score (active =1, inactive =0)

> ### Question 3.  What timeframe is this data relevent for?

Answer: Jan - Apr 2015

> ### Question 4.  What is the hypothesis?

Answer: Demographic and past customer usage data are indicators of how likely a free-tier member will convert to the paying tier

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [2]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.

Answer: (Use the template below)

Variable | Description | Type of Variable
---|---|---|
admit | 0 = Not admit, 1 = Admit | Categorical
gre | float | Continuous
gpa | float | Continuous
prestige | 1 to 5 from least to most prestigous | Categorical

We would like to explore the association between X and Y.

> ### Question 6.  What is the outcome?

Answer: Admission indicator (Yes or no)

> ### Question 7.  What are the predictors/covariates?

Answer: GRE, GPA, prestigiousness of school

> ### Question 8.  What timeframe is this data relevent for?

Answer: We would like to ideally look at the latest 10 years worth of data to ensure sufficient data points and account for any macro fluctuations that could potentially impact admission decisions. If available, I would look at data between 2006 and 2016 applicants' data.

> ### Question 9.  What is the hypothesis?

Answer: A student's GRE score, GPA, and prestige of school will allow us to predict whether he/she will be admitted to the program or not.

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: Determine which students will be admitted, using their GRE and GPA scores, and the prestige level of the school based on data from the UCLA database for students who applied between 2006 and 2016 (10 years).

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: To get a general understanding of what your data looks like (i.e. what are the variables, descriptive statistics on the variables, how large is the data set, how much of the data is missing) and do an overall sanity check (e.g. whether the data has imported correctly, data points lie within the range of allowable values)

> ### Question 12.  What are the assumptions of the distribution of data?

Answer: Typically, you might assume that continuous variables are normally distributed. In the case of GRE score and GPA, I would assume that they would be in some form normally distributed. With categorical variables, you would like them to be somewhat equally distributed.

> ### Question 13.  How will determine the distribution of your data?

Answer: Create a histogram of each of the variables

> ### Question 14.  How might outliers impact your analysis?

Answer: Outliers will skew the distribution of the data and will artificially change the mean of the data. 

> ### Question 15.  How will you test for outliers?

Answer: Create dot or scatter plots of each variable to look at the overall distribution of the data and check to see if there are any clear outliers. You could also use the Interquartile Range (IQR) method or the z score method to quantitatively look for outliers.

> ### Question 16.  What is colinearity?

Answer: Colinearity is when two or more features in the dataset are highly correlated which means one can accurately predict the other and hence might be redundant as a predictor.

> ### Question 17.  How will you test for covariance?

Answer: Print the correlation matrix of the data and see any two covariates have a correlation value close to 1 in absolute value terms

> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer: 
1) Look at all variables, number of observations, get a sense of how many missing data points there are
2) Create histograms for each variable in the dataset to understand data distribution
3) Print general descriptive statistics (e.g. mean, median, mode, Q1-Q4)
4) Test for outliers by using the IQR method or the z score method
5) Calculate the covariance and correlation matrices to understand the relationship between the covariates, if any. If there are variables that are highly correlated, we would need to decide whether to omit one of the variables.