# Template for Data Science Party Tutorials

# Preface - Installing Packages 

## Using Jupyter Notebooks 

Jupyter Notebooks is an interactive Python environment for data science.   Cells are seperated into Markdown (i.e., text) and code cells.  In this notebook, you should not need to edit code (unless you really want to!). Therefore, you can just run each cell by highlighting it and pressing "Cmd + Return" or using the "> Run" key at the top.

Note: A best practice is to import packages in the first cell of the notebook.  However, given that this is a tutorial packages will be imported in the first cell in which they are used to more closely associate the package with it's use.  

## Installing Packages 
First, we'll install some packages we'll use in the tutorial today. This may produce a lot of output so please be patient. 

In [None]:
# Install libraries if not present. 

# Use magic commands to import libraries (e.g., !pip install <librairy name>)


# Import libraries into our environment. 


## 1. Import our Data

We're going to use pandas to import and inspect our data.  We will also preview the data to allow users to understand the nature of the variables they are working with. 

In [None]:
# Import pandas 
import pandas as pd 

# Import data here as df 

# Inspect our dataframe
df.head()

## 2. Pre-processing of Data 

This section should be used for any required precessing of text.  If the data needs to be cleaned, have missing values replaced or dropped, or be formatted as a numpy array that should occur here. 

In [None]:
# Import libraries

# Preprocessing Code 

Sometimes preprocessing functions are not clear in what they're doing.  If so, call the function below so that the user gets some idea of the data transformations occuring.  When relevant include code that demonstrates the datas shape or other key characteristics.  

In [None]:
# Import libraries

# Preprocessing Code goes here.

### Splitting Data 

If the data should have a train/test split (i.e.,  you are demonstrating a supervised learning technique) split the data here.  Note that even if you are demoing an unsupervised learning technique you may still wish to split the data to demonstrate how the model behaves with unseen data.  Delete this section at your own discretion. 

In [None]:
from sklearn.model_selection import train_test_split

# Split our data. 
df_train, df_test = train_test_split(df, test_size = 0.1)

### Feature Extraction 
 
In order to prevent data leakage, you may wish to do features extraction after the train test split. Data leakage is when information from the test split is present in the training data.  For example, if you calculated a metrics that used a variables mean, you would want to use the mean from the training set not the entire dataset.  To do otherwise, uses information from the test dataset. The need for this should be determined on a case-by-case basis for the given technique.   

In [None]:
# Import libraries

# Feature extraction code goes here. 

# 3.  Model Creation:  < MODEL NAME >

Describe in the simplest terms possible: 
   - What the model is doing
   - What assumptions the model makes
   - Any requirements for the data 
    
Spell out any common abbreviations when introduced.  For example: 
   - Stochastic Gradient Descent (SGD) 
   - Non-Negative Matrix Factorization (NMF) 
   - Recurrent Neural Network (RNN) 
   
Note that some models may not be able to be trained on the issued laptops or may take to long to train.  Include the code showing how it would be trained and then use the functions in the appendix to import the trained model (e.g., weights etc).  In otherwords, even if we can't train the model live, can we get the model into memory to demonstrate it's capabilities. 

In [None]:
"""
We're going to time our training to know if it's feasible for a demo.  Shoot
for 5 minutes but no more than 10.  You can assume that most Concord issued 
laptops have simliar computing specficiations. 
"""
%%time


# Fit model 


# 4.  Model Inspection 

Provide a demonstration of what the mode has actually done. This may be demonstrating model outputs or predictions.  The goal here is to show the practical import of the model - what has it allowed us to do.  

In [None]:
# Model inspection code.  

# 5.  Assessing Unseen Data 

Demonstrate how the model handles unseen data.  This may involve calling predict or transform on unseen data under sk-learn.  How does it behave or perform?  

In [None]:
# code to assess unseen data. 

# 6. Visualization of Results 

How could we visualize results? Provide at least one simple method that allows for quick visual insight into the data.  You may wish to mention other, non-implemented, techniques that could be used for visualization or the types of questions you could ask of the data/model that would make for good visualizations.  

In [None]:
# Code to visualize data here.

# 6. Conclusion

Provide: 

1. A recap of what they learned. 
2. Resources to follow up for learning more. 
3. Encouragement to use the code for their own work. 


Please feel free to use the above code for your own projects.  


# Appendix I - Model taking too long to train? 

Note that github has a file size limit so it is often not possible to put all objects in a single dictionary, pickle it, and push it to github.  Therefore, it's recommended that you save each object to a seperate pickle file in the saved_model folder.   The below cell allows a user to read the file back in.  

**Copy for this cell:** 

Having trouble getting the model to run in the time allotted for the tutorial? Fortunately, Python has a module "pickle" that allows for the storage of objects.  I've written a version of the model featured in this notebook to file that you can read it in to finish our exercise. 

 

In [None]:
import pickle
import os

# specify the file names.  
files = [
]

# Set file path. 
path = os.getcwd()+"/saved_model/"

model_data = {}

# Read in pickle files. 
for f in files: 
    with open(path+f, 'rb') as file:
        model_data[f] = pickle.load(file)

# You may have to read in pandas dataframes using native methods. 
# Example: pd.read_csv(path+'dataframe.csv')     

# Assign objects as they are assigned in the tutorial above so that the user can continue without training

# Appendix II - Want to save your model? 

Have you tweaked the above script and want to save your own model to file? Run the cell below. 

In [None]:
import pickle 
import os 

# Set file path. 
path = os.getcwd()+"/saved_model/"
if not os.path.isdir(path): 
   os.mkdir(path)

# Use built in methods to save dataframes if possible. 
# Example: df_train.to_csv(path+'df_train.csv') 


# Specify objects to save to file as a dictionary.  The key will be used as the file name while 
# the value is the object.
files = {
}

# Write objects to file. 
for k,v in files.items(): 
    with open(path+k, 'wb') as file:
        pickle.dump(v, file)
        



## Appendix III - Getting Code on People's Computer


The proper way to get this onto code onto a user's computer is with git. However, we want to ensure that people that do not have git installed or are not familiar 
can easily download the tutorial hence the use of built in python.

In [None]:
import os
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile

zipurl = 'https://github.com/team-evolytics/data_science_party_nlp_tutorial/archive/refs/heads/main.zip'

with urlopen(zipurl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
        zfile.extractall(os.getcwd())