### Project Hypothesis Testing:   “Free Trial” Screener

*Problem description & Context:* 

Udacity courses have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

For more reference on this problem refer: https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True


#### Data description :
Columns:
- Pageviews: Number of unique cookies to view the course overview page that day. 
- Clicks: Number of unique cookies to click the course overview page that day. 
- Enrollments: Number of user-ids to enroll in the free trial that day. 
- Payments: Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)


---
### Objective 1:  A/B testing 
Questions you are asked in the quiz page:
- What metrics are used for A/B testing? Metrics related to Key Performance Indicators are good. How do we choose KPIs?  
- How many days of observation are there in the control and experimental group? (37 days both)
- How many missing values in the control and experiments data?

- Which statistical distributions are appropriate for the current problem to compare the significance of the difference between the control and experiment groups? 

- In frequentist analysis, mostly used for A/B testing, we use p-values to measure the significance of the experimental feature over the null hypothesis (the hypothesis that the new feature does not have impact). How do you compute p-values? What do p-values tell us? Are you familiar with type-I and type-II errors? Can you comment to which error types p-values are related? 
- Bonus points: are you aware of Bayesian A/B testing. If so can you briefly describe your understanding?
- Are the number of data points in the experiment enough to make a reasonable judgement? Or should Udacity run a longer experiment. Remember that running the experiment longer may be costly for many reasons, so you are always to optimal number of samples to make a statistically sound decision. 
- What does your A/B testing analysis tells you? Does the experimental feature improve Enrollment, the target variable? What is your recommendation to Udacity’s web developers?

Tasks you need to perform here to demonstrate your understanding:
 * Load and explore the control and experiment data tables 
 * Perform A/B testing analysis pay attention to the following details
 * Plan your analysis steps  - write down your plan in the Jupyter markup cell 


---
### Objective 2: Machine Learning
- Combine the control_tbl and experiment_tbl, adding an “id” column indicating if the data was part of the experiment or not
- Add a “row_id” column to help for tracking which rows are selected for training and testing in the modeling section
- Create a “Day of Week” feature from the “Date” column
- Drop the unnecessary “Date” column and the “Payments” column
- Handle the missing data (NA) by removing these rows.
- Shuffle the rows to mix the data up for learning
- Think of which data features are relevant to predicting the target variable. 
- Using the “Enrollments” columns as target variable, train a machine learning model using 5-fold cross validation the following 3 different algorithms:
    - Linear Regression
    - Decision Trees
    - XGBoost
- Calculate the Root Mean Square Error Mean Absolute Error (MAE), Root mean squared error (RMSE)  errors of the model using the test data. See <a href=https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d> here </a> for reference on these metrics.
- Compute feature importance - what’s driving the model? Which parameters are important predictors for the difference ML models?
- Develop a Story for what contributes to the goal of gaining Enrollments
- Discuss your results - draw some conclusions. For example how is the Experiment=0 or 1 variable related to the Enrollment prediction? Hint: think of positive and negative correlations. 
- What information do you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?
- Get a Learning Recommendation for those that want to learn how to implement machine learning following best practices for any business problem.
- What Should Udacity Do?

- What is the difference between using A/B testing to test a hypothesis (in this case showing a message window) vs using - Machine learning to learn the viability of the same effect?    
- Why are most examples in A/B testing examples are given for customer satisfaction in website design or related problems?
- Understand why Machine Learning is a better approach for performing A/B Testing versus traditional statistical inference (e.g. z-score, t-test)
- What is the purpose of training using k-fold cross validation instead of using the whose data to train the ML models?
- Does the "Experiment" column relevant to predicting Enrollment? What does this tell you? Compare this with the A/B testing you did earlier. 
- What information do you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?
- Develop a Story for what contributes to the goal of gaining Enrollments
- What Should Udacity Do?
- How do you improve the modeling?
