# Hooli Questions

### Read and evaluate the following problem statement: 
Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score 1 = active user, 0= inactive user) based on Hooli data from Jan-Apr 2015. 


#### 1. What is the outcome?

Conversion to paying customers (free-tier to paying)

#### 2. What are the predictors/covariates? 

age, gender, location, profession, # days since last log in, activity score (0/1)

#### 3. What timeframe is this data relevent for?

Jan - April 2005

#### 4. What is the hypothesis?

High activity / useage customers who are from affluent areas will be more likely to convert from free-tier to paying customers. 

# UCLA Admissions

### Let's get started with our dataset

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

data = pd.DataFrame(pd.read_csv("admissions.csv "))
data.head()
# printing just head and tail to save space
 

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [15]:
data.tail()

Unnamed: 0,admit,gre,gpa,prestige
395,0,620,4.0,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2
399,0,600,3.89,3


## Data Dictionary & Description

Variable | Description | Var Type
--|--|--
Admit      |     0 = no admit, 1 = admit  |  Categorical
GRE        |     Score 200 - 800          |  Continuous, Discrete
GPA        |     0 - 4.0                  |  Continuous
Prestige   |     1 = lowest prestige, 2 = low prestige, 3 = high prestige, 4 = highest prestige      | Categorical



   ### Admitted

In [16]:

data = pd.DataFrame(pd.read_csv("admissions.csv"))
data_admit = data[data["admit"]==1]
data_noadmit = data[data["admit"]==0]
data_admit.describe()


Unnamed: 0,admit,gre,gpa,prestige
count,127,127.0,126.0,126.0
mean,1,618.897638,3.489206,2.150794
std,0,108.884884,0.371655,0.921455
min,1,300.0,2.42,1.0
25%,1,540.0,3.22,1.0
50%,1,620.0,3.545,2.0
75%,1,680.0,3.7575,3.0
max,1,800.0,4.0,4.0


### Not Admitted

In [23]:
data_noadmit.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,273,271.0,272.0,273.0
mean,0,573.579336,3.345404,2.641026
std,0,116.052798,0.376773,0.917198
min,0,220.0,2.26,1.0
25%,0,500.0,3.08,2.0
50%,0,580.0,3.34,3.0
75%,0,660.0,3.61,3.0
max,0,800.0,4.0,4.0


#### 1. What is the outcome?

Admission / No Admission to graduate school at UCLA

#### 2. What are the predictors/covariates? 

GRE, GPA, Prestige

Additionally I would look to factor to the 3rd degree and look for interactions. My initial reasoning was that a high prestige school, perceived to be more competitive/difficult might mean a student with lower GRE/GPA could still be considered a competitve candidate (Ex. a 3.0 from Harvard > 4.0 from XYZ Community College) but the distribution doesn't support that so I would be testing for the nature of the relationship between Prestige and the other variables.

- GRE*GPA
- GRE*Prestige
- GPA*Prestige
- GPA*Prestige*GRE

#### 3. What timeframe is this data relevent for?

Not available / made clear, but I would say it's a risk of the data to be unaware of how and when the data was collected.

#### 4. What is the hypothesis?

Based on ditribution of the data I hypothesize that GRE score and GPA are the primary predictors of successful admission to UCLA's graduate school and that prestige has less impact or possibly a negative relationship to the likelihood an individual will be admitted.

    Using the above information, write a well-formed problem statement. 


#### 5. Problem Statement
Using 400 records we want to determine how likely potential UCLA candidates are to be admitted to grad school prior to the Fall of 2016 based on a variety of factors including GRE score, GPA, and the prestige of their undergrad school. Collection time period and methods for the data are not available and will be considered risks for this analysis.

# Exploratory Analysis Plan

Using the lab from class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

- Determine where, when, and how the data was collected
- Determine data type (cross sectional vs longitudinal), parse data and determine the veracity of the data
- Determine distribution and consider transformation
- Print basic stats for this data (Ex. mean, max, min, etc.)

#### 2a. What are the assumptions of the distribution of data? 

Normality

#### 2b. How will determine the distribution of your data? 

Histograms

#### 3a. How might outliers impact your analysis? 

Can cause data to skew

#### 3b. How will you test for outliers? 

Testing for normality is the first step, box plots are good as well..these help check for normality and detect outliers...if the normallity assumption comes through then visualizing the upper and lower tails of the distribution would show you outliers

#### 4a. What is colinearity? 

When two variables are linearly correlated to a degree that one could reliably predict the other. If can inflate the model and is why a correlation matrix and is why you look to VIF / test for severity of correlation in variables. 

#### 4b. How will you test for colinearity? 

Correlation matrix and VIF

#### 5. What is your exploratory analysis plan?

Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis 1 year from now. 

> Before reproducing analysis: Obtain data (previously I said I'd want to find out find out where the data was from and when it was collected in my exploratory analysis...that was partially for reproducability)

> Step 1 - Check for missing data and adjust the data set accordingly to improve usefulness of the data

> Step 2 - Check for multicollinearity using correlation matrix and look at each variable's VIF

> Step 3 - Conduct frequency and distribution analysis utilizing histogram visualization and determine distribution of your data

### Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model

#### 1. 
Nominal logistic regression setting admit/not admit as the 0/1 target variable

#### 2. 
Using 400 records we want to determine the influence GPA and school prestige have on a student's GRE score. We will conduct the analysis prior to the Fall of 2016. 

#### 3. 
Normality is assumed as are proper and reliable data collection methods because the exact time and methods are not known. This data is for students who applied to UCLA so we assume that there is no inherent trend higher or lower in GRE scores of those who apply to grad school at UCLA.