# Capstone: <font color = red> Think of witty title </font> 
> **Shaun Chua** 
<br>**(DSI-13)**

---

# Table of Contents: <a id="top"></a>
[**1. Problem Statement**](#1)
<br> [**2. Importing Libraries**](#2)
<br> [**3. Data Collection**](#3)
<br> &emsp; [3.1 Exploring Reddit's API](#3.1)
<br> &emsp; [3.2 Creating a Function to Get Posts](#3.2)
<br> &emsp; [3.3 Data Collection from r/gambling](#3.3)
<br> &emsp; [3.4 Data Collection from r/problemgambling](#3.4)
<br> [**4. Data Cleaning & EDA**](#4)
<br> &emsp; [4.1 Creating a Function to Preview Data](#4.1)
<br> &emsp; [4.2 Reading and Previewing Datasets](#4.2)
<br> &emsp; [4.3 Cleaning the Datasets](#4.3)
<br> &emsp;&emsp;&emsp; [4.3.1 Selecting Columns](#4.3.1)
<br> &emsp;&emsp;&emsp; [4.3.2 Combining Datasets](#4.3.2)
<br> &emsp;&emsp;&emsp; [4.3.3 Dropping Duplicates](#4.3.3)
<br> &emsp;&emsp;&emsp; [4.3.4 Resolving Missing Values for `title`](#4.3.4)
<br> &emsp;&emsp;&emsp; [4.3.5 Resolving Missing Values for `subreddit`](#4.3.5)
<br> &emsp;&emsp;&emsp; [4.3.6 Resolving Missing Values for `selftext`](#4.3.6)
<br> &emsp;&emsp;&emsp; [4.3.7 Mapping `subreddit`](#4.3.7)
<br> &emsp;&emsp;&emsp; [4.3.8 Cleaning with RegEx](#4.3.8)
<br> &emsp;&emsp;&emsp; [4.3.9 Cleaning with Stop Words](#4.3.9)
<br> &emsp;&emsp;&emsp; [4.3.10 Cleaning with Lemmetisation](#4.3.10)
<br> &emsp; [4.4 EDA: Visualisation](#4.4)
<br> &emsp;&emsp;&emsp; [4.4.1 Word Cloud](#4.4.1)
<br> &emsp;&emsp;&emsp; [4.4.2 Barh Plot](#4.4.2)
<br> [**5. Preprocessing and Modelling**](#5)
<br> &emsp; [5.1 Train Test Split](#5.1)
<br> &emsp; [5.2 MultinomialNB](#5.2)
<br> &emsp; [5.3 Logistic Regression](#5.3)
<br> &emsp; [5.4 Model Optimisation](#5.4)
<br> &emsp;&emsp;&emsp; [5.4.1 GridSearchCV](#5.4.1)
<br> &emsp;&emsp;&emsp; [5.4.2 Optimised MultinomialNB](#5.4.2)
<br> &emsp;&emsp;&emsp; [5.4.3 Optimised Logistic Regression](#5.4.3)
<br> &emsp; [5.5 Summary of Classification Metrics](#5.5)
<br> &emsp; [5.6 Fitting the Chosen Model](#5.6)
<br> &emsp; [5.7 Feature Words and Coefficients](#5.7)
<br> &emsp;&emsp;&emsp; [5.7.1 Feature Words](#5.7.1)
<br> &emsp;&emsp;&emsp; [5.7.2 Coefficients](#5.7.2)
<br> &emsp; [5.8 The ROC Curve](#5.8)
<br> [**6. Conclusion and Recommendations**](#6)
<br> [**7. Limitations**](#7)
<br> [**8. Future Directions**](#8)

# 1. Problem Statement <a id="1"></a>



## Formulating your Problem Statement

Your problem statement should the guiding principle for your project.  You can think about this as a "SMART" goal.

## **Context:**  

The inception of the Government Electronic Business Centre (GeBIZ) to standardise government tender and procurement, has significantly reduced <a href="https://opentextbc.ca/principlesofeconomics/chapter/16-1-the-problem-of-imperfect-information-and-asymmetric-information/"> imperfect information and assymetric information</a>.

As a result, education consultancies face the daunting challenge of balancing several tenets of business development, such as: 
* Outreach to educational institutions 
* Programme creation
* Programme pricing

## **Specific:** 
**What precisely do you plan to do?**
<br> Obejctive 1: Identify business units/programmes that help/hurt saleprice most 
<br> Objective 2: Create a model that may help predict saleprice for a particular programme type
<br> Objective 3: Market Basket Analysis

**What type of model will you need to develop?**
<br> 1) Linear Regression with Regularisation (Ridge, Lasso, Elastic Net)
<br> 2) Decision Tree Regressor
<br> 3) Random Forest Regressor
<br> 4) Support Vector Regressor
<br> 5) Decision Tree Regressor with AdaBoost
<br> 6) Random Forest Regressor with AdaBoost
<br> 7) Gradient Boosting Regressor 
<br> 8) Extreme Gradient Boosting (XGBoost)

## **Measurable:** 
**What metrics will you be using to assess performance?** 
* Objective 1: Coefficient of Determination ($R^2$)
* Objective 2: RMSE or MSE
* Objective 3: TBC

## **Achievable:** 
**Is your project appropriately scoped?**
<br> Yes. 

**Is it too aggressive? Too easy?**
<br> Initially, felt that simply running Linear Regression was too "simple", so I decided to add more dimensions to it, and run more models.

## **Relevant:**
**Does anyone care about this?**
<br> Education consultancies may find this insightful.

**Why should people be interested in your results?**
<br> Findings can assist an education consultancy in:
* Allocating resources to business development efforts based on data
* Predicting sale price for a particular programme
* Potential Market Basket Analysis (TBC)

**What value will the completion of your project be adding?**
Education consultancies may make more targetted efforts at business development with the aid of insights from data.

## **Time-bound**
**What is your deadline?**
* Personal Deadline: 13 Apr 2020
* Official Deadline: 23 Apr 2020

# 2. Importing Libraries <a id="2"></a>

In [1]:
import pandas as pd

In [2]:
sales2017_df = pd.read_csv("./datasets/Vivarch Sales Projection 2017 Master.csv")

In [3]:
sales2018_df = pd.read_csv("./datasets/Vivarch Sales Projection 2018 Master.csv")

In [4]:
sales2017_df

Unnamed: 0,Project Code,INVOICE Number,Invoice Date,QTY,UNIT PRICE,TOTAL,Invoiced?\n[Y / N],Invoice \nRemarks,Reason for \nnot Invoicing,Entity,...,Sep Hols,October,November,December,Payment \nReference,Payment Date,Paid Amount,Outstanding\nAmount,Service \nConsultant,Unnamed: 36
0,VARI17001,INVVARI1702001,16-Feb-2017,4.5,$150.00,675,,4.5X$150=$675,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003179999,07.03.17,$675.00,$0.00,,
1,VARI17001,INVVARI1703001,6-Mar-2017,18,$150.00,2700,,18 x $150=$2700,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003191063,23.03.17,"$2,700.00",$0.00,,
2,VARI17001,INVVARI1705001,8-May-2017,17.5,$150.00,2625,,17.5 x 150=2625,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003220326,29.05.17,"$2,625.00",$0.00,,
3,VARI17001,INVVARI1706001,14-Jun-2017,4.5,$150.00,675,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003240231,10.07.17,$675.00,$0.00,,
4,VARI17002,BILLED UNDER PASSIONISTA,,48,$85.00,0,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4115,,,,,,,,,,,...,,,,,,,,,,
4116,,,,,,,,,,,...,,,,,,,,,,
4117,,,,,,,,,,,...,,,,,,,,,,
4118,,,,,,,,,,,...,,,,,,,,,,


In [5]:
sales2018_df

Unnamed: 0,PROJECT CODE,INVOICE Number,Invoice Date,QTY,UNIT PRICE,TOTAL,Invoiced?\n[Y / N],Invoice \nRemarks,Reason for \nnot Invoicing,Entity,...,Sep Hols,October,November,December,Payment \nReference,Payment Date,Paid Amount,Outstanding\nAmount,Service \nConsultant,Unnamed: 38
0,VARI18001,,,,,$0.00,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
1,VARI18002,,,,,$0.00,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
2,VARI18003,,,,,$0.00,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
3,VARI18004,,,,,$0.00,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
4,VARI18005,,,,,$0.00,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4591,,,,,,,,,,,...,,,,,,,,,,
4592,,,,,,,,,,,...,,,,,,,,,,
4593,,,,,,,,,,,...,,,,,,,,,,
4594,,,,,,,,,,,...,,,,,,,,,,


In [6]:
sales2017_df.columns

Index(['Project Code', 'INVOICE Number', 'Invoice Date', 'QTY', 'UNIT PRICE',
       'TOTAL', 'Invoiced?\n[Y / N]', 'Invoice \nRemarks',
       'Reason for \nnot Invoicing', 'Entity', 'Actual Entity',
       'Consultant Name', 'SCHOOL', 'Zone', 'Programme \nName', 'UOM',
       'Projected \nAmount', 'January', 'February', 'March', 'March Hols',
       'April', 'May', 'June Hols', 'July', 'August', 'September', 'Sep Hols',
       'October', 'November', 'December', 'Payment \nReference',
       'Payment Date', 'Paid Amount', 'Outstanding\nAmount',
       'Service \nConsultant', 'Unnamed: 36'],
      dtype='object')

In [7]:
sales2018_df.columns

Index(['PROJECT CODE', 'INVOICE Number', 'Invoice Date', 'QTY', 'UNIT PRICE',
       'TOTAL', 'Invoiced?\n[Y / N]', 'Invoice \nRemarks',
       'Reason for \nnot Invoicing', 'Entity', 'Actual Entity',
       'Consultant Name', 'SCHOOL', 'Zone', ' ', 'UOM', 'Remarks',
       'Deal Closed \nMonth', 'Projected \nAmount', 'January', 'February',
       'March', 'March Hols', 'April', 'May', 'June Hols', 'July', 'August',
       'September', 'Sep Hols', 'October', 'November', 'December',
       'Payment \nReference', 'Payment Date', 'Paid Amount',
       'Outstanding\nAmount', 'Service \nConsultant', 'Unnamed: 38'],
      dtype='object')

In [8]:
sales2017_df.shape

(4120, 37)

In [9]:
sales2018_df.shape

(4596, 39)

you can try this: df.na.drop(how = "all"), this will remove the row only if all the rows are null or

https://forums.databricks.com/questions/15146/how-to-remove-empty-rows-from-the-data-frame.html

In [18]:
sales2017_df.dropna(axis=0, thresh=10)

Unnamed: 0,Project Code,INVOICE Number,Invoice Date,QTY,UNIT PRICE,TOTAL,Invoiced?\n[Y / N],Invoice \nRemarks,Reason for \nnot Invoicing,Entity,...,Sep Hols,October,November,December,Payment \nReference,Payment Date,Paid Amount,Outstanding\nAmount,Service \nConsultant,Unnamed: 36
0,VARI17001,INVVARI1702001,16-Feb-2017,4.5,$150.00,675,,4.5X$150=$675,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003179999,07.03.17,$675.00,$0.00,,
1,VARI17001,INVVARI1703001,6-Mar-2017,18,$150.00,2700,,18 x $150=$2700,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003191063,23.03.17,"$2,700.00",$0.00,,
2,VARI17001,INVVARI1705001,8-May-2017,17.5,$150.00,2625,,17.5 x 150=2625,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003220326,29.05.17,"$2,625.00",$0.00,,
3,VARI17001,INVVARI1706001,14-Jun-2017,4.5,$150.00,675,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,5003240231,10.07.17,$675.00,$0.00,,
4,VARI17002,BILLED UNDER PASSIONISTA,,48,$85.00,0,,,,ARTELIER (INSTRUCTOR) PTE LTD,...,,,,,,,,$0.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2188,VIVE17077,INVVIVE1711022,26TH NOV,217,$40.00,"$8,680.00",,,,VIVARCH ENRICHMENT PTE LTD,...,,,"$8,680.00",,5003320855,19.12.17,"$8,680.00",$0.00,,
2189,VIVE17078,INVVIVE1711016,24TH NOV,1,"$17,130.00","$17,130.00",,,,VIVARCH ENRICHMENT PTE LTD,...,,"$17,130.00",,,5003323984,26.12.17,"$17,130.00",$0.00,,
2190,VIVE17079,INVVIVE1711001,16th Nov,1,"$3,600.00","$3,600.00",,,,VIVARCH ENRICHMENT PTE LTD,...,,"$3,192.00",,,5003314588,11.12.17,"$3,600.00",$0.00,,
2191,VIVE17080,INVVIVE1711001,,1,$650.00,$650.00,,,,VIVARCH ENRICHMENT PTE LTD,...,,,$650.00,,5003309500,30.11.17,$650.00,$0.00,,
