## Pipeline Workflow

This noteboook includes the workflow for both processes:
1. Data Pipeline
2. Model Pipeline


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd "/content/drive/Shareddrives/SIADS - 694-695 Team Drive/python-files"
!ls -l

/content/drive/Shareddrives/SIADS - 694-695 Team Drive/python-files
total 184
-rw------- 1 root root 11560 May 30 17:02 a1_likelihood_model.py
-rw------- 1 root root  1123 May 30 16:37 a1_model_inference.py
-rw------- 1 root root  1086 May 30 16:04 a1_model_training.py
-rw------- 1 root root  1073 May 30 16:04 a2_model_inference.py
-rw------- 1 root root  1076 May 30 16:03 a2_model_training.py
-rw------- 1 root root 47838 May 25 19:39 B2_B1_clustering_code.py
-rw------- 1 root root 47276 May 25 20:27 B2_B1_model_training.py
-rw------- 1 root root  1577 May 20 00:04 clean_dataset.py
-rw------- 1 root root  2434 May 12 00:19 create_sample_dataset.py
-rw------- 1 root root  2988 May 15 00:57 data_extraction.py
-rw------- 1 root root 20393 May 26 15:25 data_prep.py
-rw------- 1 root root     0 Apr 23 23:30 data_preprocessing.py
-rw------- 1 root root  1005 May 20 00:44 data_vis.py
-rw------- 1 root root 24216 May 30 16:59 orchestrator.ipynb
drwx------ 2 root root  4096 May 11 04:15 __pycac

In [None]:
!python --version

Python 3.7.13


In [None]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement pickle (from versions: none)[0m
[31mERROR: No matching distribution found for pickle[0m


If you need a library that's not built into Colab's environment, you can add them as follows:

In [None]:
## installing relevant packages
# !pip install google-search-results package2 package3 ...
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip --quiet
!pip install markupsafe~=2.1.1 --quiet
!pip install folium==0.2.1 --quiet
!pip install markupsafe==2.0.1 --quiet
!pip install imbalanced-learn --quiet
!pip install scikit-learn-extra --quiet
!pip install factor_analyzer --quiet
!pip install --upgrade category_encoders --quiet
!pip install --upgrade s-dbw --quiet
!pip install imbalanced-learn --quiet
!pip install scikit-learn-extra --quiet
!pip install factor_analyzer --quiet
!pip install prince --quiet
!pip install selenium --quiet
!pip install pickle5 --quiet
!pip install pyspark --quiet
!pip install plotly --quiet
!pip install pyyaml==5.4.1
!pip install chart-studio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chart-studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 2.4 MB/s 
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25l[?25hdone
  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11447 sha256=03f92876175fc10db347c6ec7cc3c4fd3591da45c9ebd9d059e3bbdd6eaf8b41
  Stored in directory: /root/.cache/pip/wheels/f9/8d/8d/f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf
Successfully built retrying
Installing collected packages: retrying, chart-studio
Successfully installed chart-studio-1.1.0 retrying-1.3.3


### Step 1: Data Extraction Portion

Goal for this step: 
- Pull from BigQuery and generate 6 months frozen dataset


In [None]:
!python data_extraction.py \
  --output_directory ../datasets/monthly-partitioned-data/ \
  --credentials_json ../credentials/compact-scene-317315-56e479e9e148.json

### Step 2: Transaction Extraction Portion

**WARNING: THIS TAKES ~12 HOURS TO RUN**

Goal:
- Pull all Transactions from monthly datasets from Step 1
- Pull all Non-Transactions from monthly datasets from Step 1


In [None]:
%%time
!python transaction_extraction.py \
  --input_directory ../datasets/monthly_partitioned_data/ \
  --output_directory ../datasets/monthly_partitioned_data_transactions/

### Step 3: Data Sampling Portion

Goal:
1. Reads all transactions CSV from the input directory and join as one DF.
2. Reads all non-transactions CSV from the input directory and samples 10%
3. Join both results from Step 1 and 2 into one DF called `sample_dataset.csv`


In [None]:
%%time
!python create_sample_dataset.py \
  --input_directory ../datasets/monthly_partitioned_data_transactions/ \
  --output_file ../datasets/sample_dataset.csv

../datasets/monthly_partitioned_data_transactions/ ../datasets/sample_dataset.csv
['../datasets/monthly_partitioned_data_transactions/non_transactions/non_transactions_from_January 2017 Google Analytics Dataset.csv', '../datasets/monthly_partitioned_data_transactions/non_transactions/non_transactions_from_August 2016 Google Analytics Dataset.csv', '../datasets/monthly_partitioned_data_transactions/non_transactions/non_transactions_from_September 2016 Google Analytics Dataset.csv', '../datasets/monthly_partitioned_data_transactions/non_transactions/non_transactions_from_October 2016 Google Analytics Dataset.csv', '../datasets/monthly_partitioned_data_transactions/non_transactions/non_transactions_from_November 2016 Google Analytics Dataset.csv', '../datasets/monthly_partitioned_data_transactions/non_transactions/non_transactions_from_December 2016 Google Analytics Dataset.csv']
100% 6/6 [03:32<00:00, 35.37s/it]
CPU times: user 2.91 s, sys: 595 ms, total: 3.5 s
Wall time: 6min 2s


### Step 4: Common Data Cleaning & Descriptive Analysis Portion

Goal:
- Drop unneeded columns (Refer to this [link](https://docs.google.com/spreadsheets/d/1fT-iZyGZnpkli9ve9EY9TRTfycNlIoxp/edit?usp=sharing&ouid=116636835356831800242&rtpof=true&sd=true))
- Cast to appropriate datatypes

On a different notebook:
- Visualizations
- EDA
- Describing Data
- Histograms
- Dashboarding ([Pandas Profiling](https://github.com/ydataai/pandas-profiling))


In [None]:
!python clean_dataset.py \
  --input_file ../datasets/sample_dataset.csv \
  --output_file ../datasets/cleaned_dataset.csv \
  --dashboard_file ../datasets/dashboard_files/cleaned_dataset_dashboard.html

tcmalloc: large alloc 1073741824 bytes == 0x4654c000 @  0x7f76673da2a4 0x7f765582e9a5 0x7f765582fcc1 0x7f765583169e 0x7f765580250c 0x7f765580f399 0x7f76557f797a 0x59afff 0x515655 0x549e0e 0x593fce 0x511e2c 0x549576 0x593fce 0x511e2c 0x593dd7 0x5118f8 0x549576 0x4bcb19 0x5134a6 0x549e0e 0x593fce 0x548ae9 0x51566f 0x549576 0x604173 0x5f5506 0x5f8c6c 0x5f9206 0x64faf2 0x64fc4e
tcmalloc: large alloc 2147483648 bytes == 0x96d5e000 @  0x7f76673da2a4 0x7f765582e9a5 0x7f765582fcc1 0x7f765583169e 0x7f765580250c 0x7f765580f399 0x7f76557f797a 0x59afff 0x515655 0x549e0e 0x593fce 0x511e2c 0x549576 0x593fce 0x511e2c 0x593dd7 0x5118f8 0x549576 0x4bcb19 0x5134a6 0x549e0e 0x593fce 0x548ae9 0x51566f 0x549576 0x604173 0x5f5506 0x5f8c6c 0x5f9206 0x64faf2 0x64fc4e
Summarize dataset: 100% 101/101 [00:30<00:00,  3.31it/s, Completed]
Generate report structure: 100% 1/1 [00:06<00:00,  6.35s/it]
Render HTML: 100% 1/1 [00:04<00:00,  4.24s/it]
Export report to file: 100% 1/1 [00:00<00:00,  1.28it/s]


### Step 5: Data Prep for Modeling

Goal:
- Create 4 datasets as inputs for 4 models
- Prep should include:
  * Column Dropping
  * Column Encoding

The 4 datasets for each model should be:
- A1- Likelyhood to convert data set (Visitors)
- B2- Complex clustering data set (Visitors)
- B1- RFM data set
- A2- Returning customers data set
- [Optional] A3- Attribution model data set


In [None]:
!python data_prep.py \
  --input_file ../datasets/cleaned_dataset.csv \
  --output_directory ../datasets/model_files

  import pandas.util.testing as tm
../datasets/cleaned_dataset.csv ../datasets/model_files
2016-08-01 00:00:00 succesfully transformed to a datime object
Divided by 10^6
Transformations being made
1 date NO CHANGE
2 fullVisitorId NO CHANGE
3 socialEngagementType fillnan
3 socialEngagementType binary
3 socialEngagementType int64
4 channelGrouping one_hot drop
5 totals.hits int64
5 totals.hits NO CHANGE
6 totals.pageviews fillnan
7 totals.timeOnSite fillnan
8 totals.transactions fillnan
9 totals.newVisits fillnan
9 totals.newVisits int64
10 hits.eCommerceAction.action_type int64
10 hits.eCommerceAction.action_type NO CHANGE
11 totals.bounces fillnan
11 totals.bounces int64
0.1183562197092084
12 geoNetwork.country geoNetwork.country_woe woe
12 geoNetwork.country drop
0.1183562197092084
13 trafficSource.source trafficSource.source_woe woe
13 trafficSource.source drop
14 trafficSource.medium one_hot drop
15 trafficSource.isTrueDirect fillnan
cat_code trafficSource.isTrueDirect
0.11835621970

### Step 6: Data Visualization

Goal:
- Visualize the model datasets.

To the dashboard_files file:
- Visualizations
- EDA
- Describing Data
- Histograms
- Dashboarding ([Pandas Profiling](https://github.com/ydataai/pandas-profiling))

In [None]:
!python data_vis.py \
  --input_directory ../datasets/model_files/ \
  --output_directory ../datasets/dashboard_files/

Summarize dataset: 100% 34/34 [00:06<00:00,  5.61it/s, Completed]
Generate report structure: 100% 1/1 [00:03<00:00,  3.10s/it]
Render HTML: 100% 1/1 [00:00<00:00,  1.25it/s]
Export report to file: 100% 1/1 [00:00<00:00, 72.90it/s]
Summarize dataset: 100% 29/29 [00:05<00:00,  4.96it/s, Completed]
Generate report structure: 100% 1/1 [00:04<00:00,  4.64s/it]
Render HTML: 100% 1/1 [00:00<00:00,  1.54it/s]
Export report to file: 100% 1/1 [00:00<00:00, 74.63it/s]
Summarize dataset: 100% 203/203 [01:04<00:00,  3.17it/s, Completed]
Generate report structure: 100% 1/1 [00:19<00:00, 19.47s/it]
Render HTML: 100% 1/1 [00:05<00:00,  5.91s/it]
Export report to file: 100% 1/1 [00:00<00:00, 22.47it/s]
Summarize dataset: 100% 203/203 [00:47<00:00,  4.24it/s, Completed]
Generate report structure: 100% 1/1 [00:19<00:00, 19.36s/it]
Render HTML: 100% 1/1 [00:06<00:00,  6.04s/it]
Export report to file: 100% 1/1 [00:00<00:00, 22.96it/s]


# Modeling Pipeline

For each model, we did the following:
1. Train the model and save the model
2. Model Inference and save results (visualization, csv, etc.)

## Supervised Learning (A1): Conversion Likelihood Analysis

**WARNING: THIS CAN TAKE ~1 HOUR 30 MINUTES TO RUN**

Goal: Identify whether a user is a potential customer.

This step of the pipeline (A1) hasn't been fully tested here as it trains the model on pysparks for efficiency reasons. However, the main requirements of Supervised Learning A2 and Unsupervised Learning B1 & B2 has been fully tested in this pipeline.

In [None]:
!python a1_model_training.py \
  --input_dataset ../datasets/model_files/A1_B2_data.csv \
  --output_directory ../models/spark_models/ \
  --output_result_directory ../results/ \
  --output_visualization_directory ../visualizations/ \
  --save True

  defaults = yaml.load(f)
../datasets/model_files/A1_B2_data.csv ../models/spark_models/ ../visualizations/
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/31 02:08:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/05/31 02:09:02 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
22/05/31 02:09:03 WARN TaskSetManager: Stage 0 contains a task of very large size (4111 KiB). The maximum recommended task size is 1000 KiB.
22/05/31 02:09:49 WARN TaskSetManager: Stage 3 contains a task of very large size (4111 KiB). The maximum recommended task size is 1000 KiB.
22/05/31 02:09:54 WARN TaskSetManager: Stage 4 contains a task of very large 

## Supervised Learning (A2): Repurchaser Analysis

Goal: Identify whether a user is a potential repurchaser or not, i.e. a potential returning customer

In [None]:
!python a2_model_training.py \
  --input_dataset ../datasets/model_files/A2_return_data.csv \
  --output_directory ../models/ \
  --output_result_directory ../results/ \
  --output_visualization_directory ../visualizations/

../datasets/model_files/A2_return_data.csv ../models/ ../results/ ../visualizations/
Beginning Logistic Regression
Beginning SVC
Beginning KNN
Beginning Decision Trees
Beginning Random Forests
<Figure size 2000x2000 with 2 Axes>
<Figure size 800x300 with 2 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 2400x800 with 1 Axes>
Frequency 0.8626647389272503
channelGrouping_Organic Search 0.01817720718509018
Monetary 0.016707180105817145
channelGrouping_Direct 0.012295209751090544
totals.timeOnSite 0.011119207277461903
totals.pageviews 0.010012481758793837
channelGrouping_Referral 0.009129065315048729
Recency 0.008044090914501987
channelGrouping_Paid Search 0.005761877998246554
trafficSource.medium_cpm 0.004599542913452809
trafficSource.medium_referral 0.004566293966799045
hits.hour_ordinal 0.00455522685965397
geoNetwork.country_woe 0.0043395415727116166
device.operatingSystem_Linux 0.004314187243082622
totals.hits 0.003972729623929231
device.operatingSystem_iOS 0.0028149935408655702
t

## Unsupervised (B1 and B2) Model (Training and Model Inference) 

In [None]:
!python B2_B1_model_training.py

0- importing packages
1- downloading file
2- rebalancing classes
3- scaling df
3- scaling df
4.1. returning spree chart
4.1. returning pca heatmap
4.1. processing and saving pca balanced class
4.1. returning pca heatmap
4.1. processing and saving pca imbalanced data set
4.2- downloading famd df, model and visualizations
5.1- downloading kmeans scores
5.1- downloading kmeans scores
5.1- downloading kmeans scores
5.1- downloading kmeans scores
error with saving DF as image
<Figure size 800x300 with 2 Axes>
<Figure size 1400x400 with 1 Axes>
<Figure size 4000x100 with 2 Axes>
<Figure size 4000x100 with 2 Axes>
<Figure size 1800x400 with 4 Axes>
5.1- returning elbow scores
5.1- downloading elbow scores, silhouette and score tables
error with saving DF as image
5.2- pca returning groupes and pca_transformed
5.2- returning pca/kmeans full heat map
5.2 downloading biplot clustering visualization
5.1- downloading kmeans scores
5.1- downloading kmeans scores
5.1- downloading kmeans scores
5.1- 