# DS-SF-30 | Unit Project 1: Research Design

In this first unit project you will create a framework to scope out data science projects.  This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will convert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer usage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."

> ### Question 1.  What is the outcome?

Answer: Free-tier customer to paying customer conversion indicator (**yes** / **no**)

> ### Question 2.  What are the predictors/covariates?

Answer: age, gender, location, profession, days since last login, activity score

> ### Question 3.  What timeframe is this data relevent for?

Answer: January 2015 - April 2015

> ### Question 4.  What is the hypothesis?

Answer: Demographic and usage data will allow us to predict if a free-tier customer will convert to a paying customer.

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [1]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.

Answer:

Variable | Description | Type of Variable
---|---|---
admit | 0 = Not admitted, 1 = Admitted | Categorical
gre | GRE score | Continuous
gpa | GPA score | Continuous
prestige | 1 - most prestigios, 4 - least prestigious | Categorical

We would like to explore the association between X and Y.

In [5]:
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,400.0,398.0,398.0,399.0
mean,0.3175,588.040201,3.39093,2.486216
std,0.466087,115.628513,0.38063,0.945333
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [4]:
df.corr()

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.182919,0.175952,-0.241355
gre,0.182919,1.0,0.382408,-0.124533
gpa,0.175952,0.382408,1.0,-0.059031
prestige,-0.241355,-0.124533,-0.059031,1.0


> ### Question 6.  What is the outcome?

Answer: Admittance indicator (**yes** or **no**)

> ### Question 7.  What are the predictors/covariates?

Answer: gre, gpa, prestige

> ### Question 8.  What timeframe is this data relevent for?

Answer:

> ### Question 9.  What is the hypothesis?

Answer: Looking at past data on various factors that affect admittance like gre, gpa and prestige will allow us to predict future admittance

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: Determine if a student will get admitted into UCLA based on the past admission data which includes students' gre and gpa and their alta mater's prestige

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: The goal is to see how these individual variables (gre, gpa, prestige) are related to the binary response variable admit and if we can use existing data to predict admittance in the future.

> ### Question 12.  What are the assumptions of the distribution of data?

Answer: The assumption is that the data is evenly distributed without too many outliers.

> ### Question 13.  How will determine the distribution of your data?

Answer: We can plot it to see the distribution.

> ### Question 14.  How might outliers impact your analysis?

Answer: Outliers might affect our analysis by throwing off our quantitative analysis numbers.

> ### Question 15.  How will you test for outliers?

Answer: We can use box plots to see if there are any outliers.

> ### Question 16.  What is colinearity?

Answer: Colinearity refers to high correlation between variables.

> ### Question 17.  How will you test for covariance?

Answer: We can use the correlation matrix to test for covariance.

> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer:
1. Take a look at some sample data to see if it needs any clean up.
2. Do some descriptive analysis on the data to understand it.
3. Plot the data to see if there are any outliers (box plot can help).
4. Generate a correlation matrix to analyze how the variables are related to each other.
5. Doing a scatter plot will also help understand how the variables are related to each other and/or the outcome.