# DS-NYC-45 | Unit Project 1: Research Design Write-Up

In this first unit project you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."

> ### Question 1.  What is the outcome?

Answer: Return customer indicator (yes or no)

> ### Question 2.  What are the predictors/covariates?

Answer: age, gender, location, profession, days since last log in, activity score

> ### Question 3.  What timeframe is this data relevent for?

Answer: January-April 2015

> ### Question 4.  What is the hypothesis?

Answer: Demographic data and customer usage data will allow us to predict if a free-tier customer will convert to a paying customer.

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [3]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.


Variable | Description | Type of Variable
---|---|---
admit | whether a candidate was admitted to UCLA (admit = 1, not admit = 0)| Categorical
gre | GRE score | Continuous
gpa | GPA | Continuous
prestige | prestige of an applicant alta mater (1, 2, 3, 4, in descending order of prestige value) | Categorical


We would like to explore the association between admit and gre, admit and gpa, and admit and prestige.

> ### Question 6.  What is the outcome?

In [6]:
df.corr(method='pearson')

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.182919,0.175952,-0.241355
gre,0.182919,1.0,0.382408,-0.124533
gpa,0.175952,0.382408,1.0,-0.059031
prestige,-0.241355,-0.124533,-0.059031,1.0


Positive correlation between GRE score and admit = Higher GRE score is associated with greater likelihood of admission.

Positive correlation between GPA and admit = Higher GPA is associated with greater likelihood of admission.

Negative correlation between lower prestige rank (higher number) and admit = Higher prestige rank is associated with greater likelihood of admission.

> ### Question 7.  What are the predictors/covariates?

Answer:

GRE, GPE, prestige

> ### Question 8.  What timeframe is this data relevent for?

Answer: The data set doesn't include timeframe information.

> ### Question 9.  What is the hypothesis?

Answer: Among applicants included in the data set, applicants with higher GPA, higher GRE, and/or higher prestige rank of applicant alma mater were more likely to be admitted to UCLA.

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: Determine the association between admission to UCLA and the following factors: GRE, GPA, and prestige of applicant alma mater. Using cross-sectional data from the UCLA admissions data set, we will determine the factors associated with admission rate. We will test if GRE, GPA, or prestige of applicant alma mater increased the likelihood that an applicant would be admitted to UCLA. 

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: Understand the distribution of data and any associations between variables.

> ### Question 12.  What are the assumptions of the distribution of data?

Answer: Assume normal distribution for GRE, GPA, and prestige (with most students being "average"). We don't have context for where this data set is coming from, though, so that makes assumptions more difficult. 

> ### Question 13.  How will determine the distribution of your data?

Answer: We can map the data to visualize any larger trends. We can also run specific tests to determine distribution fit.


> ### Question 14.  How might outliers impact your analysis?

Answer: Outliers can skew results of analysis because they are not representative of general trends.

> ### Question 15.  How will you test for outliers?

Answer: Determine standard deviations, choose cutoff, view any data outside that cutoff.

> ### Question 16.  What is colinearity?

Answer: Colinearity describes the situation when two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

> ### Question 17.  How will you test for covariance?

Answer: Use the ANCOVA model to compare regression lines for different variables.

> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer: Map the data and view large trends. Remove any outliers by determining a standard deviation cutoff. Determine regression lines for each variable. Test for covariance.