# Project 1

In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

### Read and evaluate the following problem statement: 
Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score `1 = active user`, `0 = inactive user`) based on Hooli data from Jan-Apr 2015. 


#### 1. What is the outcome?

Answer: Customer indicator (yes/no)

#### 2. What are the predictors/covariates? 

Answer: Age, gender, location, profession, days since last log in, activity score

#### 3. What timeframe is this data relevant for?

Answer: January-April 2015 

#### 4. What is the hypothesis?

Answer: Demographic and prior useage data will predict if a customer will switch to a paying customer

## Let's get started with our dataset

In [1]:
import os
import numpy as np
import pandas as pd

# If you checked-out the GitHub repository, the UCLA dataset is under ../assets/admissions.csv (relative to this file)
path = os.path.join('..', 'assets', 'admissions.csv')

"""""
# Alternative: Get the dataset directly online...
path = 'http://github.com/ga-students/sf-dat-21/raw/master/unit-projects/dataset/admissions.csv'
"""""

df = pd.read_csv(path)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [5]:
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,400.0,398.0,398.0,399.0
mean,0.3175,588.040201,3.39093,2.486216
std,0.466087,115.628513,0.38063,0.945333
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


#### 1. Create a data dictionary 

Answer (Use the template below): 

Variable | Description | Type of Variable
---| ---| ---
Var 1 | 0 = not thing 1 = thing | categorical
Var 2 | thing in unit X | continuous 


In [20]:
df2 = pd.DataFrame([{'Variable' : 'admit',
                         'Description' : '0 = not admitted 1 = admitted',
                         'Type of Variable' : 'categorical'}, 
                    
                    {'Variable' : 'gre',
                         'Description' : 'gre score',
                         'Type of Variable' : 'continuous'},
                    
                    {'Variable' : 'gpa',
                         'Description' : 'gpa score',
                         'Type of Variable' : 'continuous'},
                    
                    {'Variable' : 'prestige',
                         'Description' : 'prestige score',
                         'Type of Variable' : 'continuous'},
                   
                   ])

df2.set_index(['Variable'])

Unnamed: 0_level_0,Description,Type of Variable
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1
admit,0 = not admitted 1 = admitted,categorical
gre,gre score,continuous
gpa,gpa score,continuous
prestige,prestige score,continuous


We would like to explore the association between X and Y 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
admit       400 non-null int64
gre         398 non-null float64
gpa         398 non-null float64
prestige    399 non-null float64
dtypes: float64(3), int64(1)
memory usage: 12.6 KB


In [9]:
df.dtypes

admit         int64
gre         float64
gpa         float64
prestige    float64
dtype: object

#### 2. What is the outcome?

Answer: Student admission (Yes/No)

#### 3. What are the predictors/covariates? 

Answer: GRE, GPA and prestige scores

#### 4. What timeframe is this data relevant for?

Answer: Not indicated in the dataset

#### 4. What is the hypothesis?

Answer: Students with higher GRE, GPA and prestige scores will more likely be accepted to UCLA.

    Using the above information, write a well-formed problem statement. 


## Problem Statement

### Exploratory Analysis Plan

Using the lab from a class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

Answer: To make sure the data set is a reliable source, to learn about the variables, to form a hypothesis

#### 2a. What are the assumptions of the distribution of data? 

Answer: We are assuming the dataset is normally distributed

#### 2b. How will determine the distribution of your data? 

Answer: Plotting the data, if the mean and median are the same, it is normally distributed. If the SD is close to 0.68, then the dataset is normally distributed.

#### 3a. How might outliers impact your analysis? 

Answer: Outliers might skew the GPA, GRE and prestige scores of admitted students

3b. How will you test for outliers? 

Answer: Comparing the mean and median - if they are the same or close, then there aren't any outliers. The more skewed the data is, the more outliers there are

#### 4a. What is collinearity? 

Answer: Occurs when two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

#### 4b. How will you test for collinearity? 

Answer: If the correlation coefficient is close to 1 using the corr() function.

#### 5. What is your exploratory analysis plan?
Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now. 

Answer:  
1) Create a data dictionary to understand the variables and how it's coded. 
2) Calculate summary statistics to calculate the mean, median, max, min, SD, etc. -- df.describe()
3) Find out the distribution of the data by calculating the mean/median.
4) Check for outliers -- calculate the correlation coefficient using the corr() function
5) Determine if there are missing values in the dataset and if we would want to include them or not.


## Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model