[![Review Assignment Due Date](https://classroom.github.com/assets/deadline-readme-button-22041afd0340ce965d47ae6ef1cefeee28c7c493a6346c4f15d667ab976d596c.svg)](https://classroom.github.com/a/21RyuT3T)
[![Open Lab in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1ctGCAyrTb2-6oiEuri2F_toW87FZunKP/view?usp=sharing)

# Stat 220 Final Lab Instructions

## Project Description

**Project Description**: You are hired as data scientists by Mashable, an online news platform that generates buzz through shares of its posts. Your task is to build a model to predict the number of shares a news article will receive based on its characteristics.

**Data**: The data comes from Mashable.com, hosted on the UC Irvine Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity. You can download the dataset from there or at https://richardson.byu.edu/220/OnlineNewsPopularity.csv. There are 61 variables in total. A description of the variables is available at https://richardson.byu.edu/220/ONPvariables.txt. The target variable is the number of shares a news article receives, located as the last variable in the dataset.

**Deliverables**: Your work will culminate in two key deliverables:

1. A script or notebook containing all analyses and modeling steps.
2. A technical report for Mashable, written according to the instructions below.

## Project Details

### Exploratory Data Analysis (EDA)

1. Plot the target variable. Determine if the target variable seems appropriate or if any transformations are needed.
2. Build a linear regression model without higher-order terms and identify the most significant predictors.
3. Build a regression tree to identify important predictors.
4. Select several significant features from steps 2 and 3. Create visualizations or tables to explore the relationships between these features and the target variable.
5. Write an EDA section in your technical report. Report the results of the initial models and include figures or tables that show the target variable and its relationship with potentially significant predictors.
6. Use appropriate methods to remove insignificant variables from the model.

**Linear Regression Modeling**: Build and tune a linear regression model with high predictive power, explaining to Mashable which features most influence the number of shares.

1. Split the data into training and testing sets. Use the training set for model fitting and the testing set to check for overfitting and predictive performance.
2. Explore transformations of the target and other variables.
3. Explore higher-order terms.
4. Reduce the model using the following methods:
   * Stepwise model evaluation methods to remove insignificant variables.
   * LASSO regression to fit the full model and remove insignificant variables. Tune the model to find the best `α`.
5. Write a section in your technical report that reports the out-of-sample performance of the models. Discuss the most significant predictors and evaluate the model's usefulness for predicting future shares.

**Regression Tree Modeling**: Build and tune a regression tree model.

1. Use the same training and testing sets as above.
2. Use cost-complexity pruning and cross-validation to find a model that fits well on out-of-sample data.
3. Fit a random forest regression model, using cost-complexity pruning for the individual trees.
4. Write a section in your technical report that reports the out-of-sample performance of the models. Discuss the model’s usefulness for predicting future shares.

**Conclusion**: Compare each model’s predictive accuracy on the test set. Choose the best-performing model as the final predictive model. Write a concluding section in your technical report that addresses Mashable's business concerns and presents your final model along with your confidence in its predictions.

In [5]:
import pandas as pd

***Data***: _The data comes from Mashable.com, hosted on the UC Irvine Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity. You can download the dataset from there orat https://richardson.byu.edu/220/OnlineNewsPopularity.csv. There are 61 variables in total. A description of the variables is available at https://richardson.byu.edu/220/ONPvariables.txt. The target variable is the number of shares a news article receives, located as the last variable in the dataset._

***Deliverables***: _Your work will culminate in two key deliverables:_

1. _A script or notebook containing all analyses and modeling steps._
2. _A technical report for Mashable, written according to the instructions below._

In [6]:
news = pd.read_csv('https://richardson.byu.edu/220/OnlineNewsPopularity.csv')
news.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505


In [7]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39644 entries, 0 to 39643
Data columns (total 61 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   url                             39644 non-null  object 
 1    timedelta                      39644 non-null  float64
 2    n_tokens_title                 39644 non-null  float64
 3    n_tokens_content               39644 non-null  float64
 4    n_unique_tokens                39644 non-null  float64
 5    n_non_stop_words               39644 non-null  float64
 6    n_non_stop_unique_tokens       39644 non-null  float64
 7    num_hrefs                      39644 non-null  float64
 8    num_self_hrefs                 39644 non-null  float64
 9    num_imgs                       39644 non-null  float64
 10   num_videos                     39644 non-null  float64
 11   average_token_length           39644 non-null  float64
 12   num_keywords                   

## Project Details

### Exploratory Data Analysis (EDA)

1. _Plot the target variable. Determine if the target variable seems appropriate or if any transformations are needed._

2. _Build a linear regression model without higher-order terms and identify the most significant predictors._

3. _Build a regression tree to identify important predictors._

4. _Select several significant features from steps 2 and 3. Create visualizations or tables to explore the relationships between these features and the target variable._

5. _Write an EDA section in your technical report. Report the results of the initial models and include figures or tables that show the target variable and its relationship with potentially significant predictors._

6. _Use appropriate methods to remove insignificant variables from the model._

***Linear Regression Modeling***: _Build and tune a linear regression model with high predictive power, explaining to Mashable which features most influence the number of shares._

1. _Split the data into training and testing sets. Use the training set for model fitting and the testing set to check for overfitting and predictive performance._

2. _Explore transformations of the target and other variables._

3. _Explore higher-order terms._

4. _Reduce the model using the following methods:_
   * _Stepwise model evaluation methods to remove insignificant variables._

4. _Reduce the model using the following methods:_
   * _LASSO regression to fit the full model and remove insignificant variables. Tune the model to find the best `α`._

5. _Write a section in your technical report that reports the out-of-sample performance of the models. Discuss the most significant predictors and evaluate the model's usefulness for predicting future shares._

***Regression Tree Modeling***: _Build and tune a regression tree model._

1. _Use the same training and testing sets as above._
2. _Use cost-complexity pruning and cross-validation to find a model that fits well on out-of-sample data._

3. _Fit a random forest regression model, using cost-complexity pruning for the individual trees._

4. _Write a section in your technical report that reports the out-of-sample performance of the models. Discuss the model’s usefulness for predicting future shares._

***Conclusion***: _Compare each model’s predictive accuracy on the test set. Choose the best-performing model as the final predictive model. Write a concluding section in your technical report that addresses Mashable's business concerns and presents your final model along with your confidence in its predictions._