<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Research-and-Development-(RND)" data-toc-modified-id="Research-and-Development-(RND)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Research and Development (RND)</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Basic-Stats" data-toc-modified-id="Basic-Stats-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Basic Stats</a></span></li><li><span><a href="#Correlations" data-toc-modified-id="Correlations-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Correlations</a></span></li><li><span><a href="#Regressions" data-toc-modified-id="Regressions-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Regressions</a></span></li></ul></li></ul></div>

# Research and Development (RND)

This notebook attempts to capture the basic steps involved in most initial research and development (RND) activities leading up to the scripting, modularization, and packaging of production-ready code. In the [Domino Data Lab Data Science Lifecycle](https://www.dominodatalab.com/resources/field-guide/managing-data-science-projects/) (a personal favorite of mine), RND aims to generate valuable insights that the business needs to make decisions.


![img](../assets/dsci-lifecycle-rnd.png)

## Imports

In [1]:
import fuzzywuzzy
import glob
import humanize
import itertools
import json
import missingno as msno
import os
import pickle
import random
import re
import statistics
import geopandas as gpd
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import recordlinkage
import requests
import scipy as sp
import seaborn as sns
from dotenv import load_dotenv
from pandas_profiling import ProfileReport
from string import punctuation

%matplotlib inline
import matplotlib as mpl
import matplotlib.font_manager
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
from matplotlib.ticker import PercentFormatter

# dot env for secrets
load_dotenv()
some_apikey = os.getenv("SOME_KEY")

# mapbox
TOKEN = os.getenv("MAPBOX_TOKEN")
px.set_mapbox_access_token(TOKEN)
MAPBOX_STYLE = "dark"
MAPBOX_HEIGHT = 800

# matplotlib configs
matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext="ttf")
plt.style.use("seaborn-colorblind")
plt.rcParams["font.family"] = "sans-serif"
plt.rcParams["font.sans-serif"] = "Open Sans"
rcParams["figure.figsize"] = 15, 6

# watermark
%reload_ext watermark
%watermark -a 'Ken Cavagnolo' -n -u -v -m -h -g -p jupyter,notebook,pandas,numpy,scipy

Author: Ken Cavagnolo

Last updated: Mon Aug 09 2021

Python implementation: CPython
Python version       : 3.8.0
IPython version      : 7.25.0

jupyter : 1.0.0
notebook: 6.4.0
pandas  : 1.3.0
numpy   : 1.21.0
scipy   : 1.7.0

Compiler    : GCC 10.3.0
OS          : Linux
Release     : 5.11.0-7620-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

Hostname: goldfinch

Git hash: 827800850bf3c63cbde3dbf16410107476f840b0



## Load Data

In [None]:
df = pd.read_hdf("clean-data.h5", "data")

## Basic Stats

In [None]:
# column of interest
data_col = "col_a"

In [None]:
# create a model Gaussian CDF
mean, std = df[data_col].mean(), df[data_col].std()
dist = sp.stats.norm(mean, std)

# evaluate the model CDF
xs = np.linspace(df[data_col].min(), df[data_col].max())
ys = dist.cdf(xs)

In [None]:
# plot the model CDF
fig, ax = plt.subplots()
plt.plot(xs, ys, color='gray')

# ECDF
sns.ecdfplot(data=df, y=data_col, ax=ax)

In [None]:
# PMF == range of discrete random variables
probabilities = df["data_col"].value_counts(normalize=True)
sns.barplot(probabilities.index, probabilities.values)

In [None]:
# create a model Gaussian PDF
ys = dist.pdf(xs)

In [None]:
# plot the model PDF
fig, ax = plt.subplots()
plt.plot(xs, ys, color='gray')

# PDF == range of continuous random variables
sns.displot(data=df, x=data_col, kind="hist")

# KDE == smoothed range of continuous random variables
sns.displot(data=df, x=data_col, kind="kde")

## Correlations

In [None]:
# DON'T BE DUPED! Corr works for linear relationships only
corr_cols = ["col_a", "col_b", "col_c"]
corr = df[corr_cols].corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots()

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

## Regressions

In [None]:
# fit a linear regression model
xs = df.col_a
ys = df.col_b
fit = sp.stats.linregress(xs, ys)
print(fit)

ax = sns.regplot(data=df, x="col_a", y="col_b", x_jitter=.1)