# A toolkit of handy Python snippets for analysing data

## Placeholder for an index ##

### Basic useful exploratory stuff ###

#### <font color=red>**Finding and eliminating errors</font>**

The interactive debugger is also a magic function but I have given it a category of its own. If you get an exception while running the code cell, type %debug in a new line and run it. This opens an interactive debugging environment that brings you to the position where the exception has occurred. You can also check for the values of variables assigned in the program and also perform operations here. To exit the debugger hit q.

#### <font color=red>**Changing markdown text boxes</font>**

<div class="alert alert-block alert-info">Text goes here.</div>

#### <font color=red>**Calculating timestamp differences</font>**

In [None]:
df.timestamp.max(), df.timestamp.min()

#### <font color=red>**To count the number of unique record types in a column</font>**

In [None]:
df.action.value_counts()

#### <font color=red>**Commenting out code automatically</font>**

Ctrl/Cmd + / comments out selected lines in the cell by automatically. Hitting the combination again will uncomment the same line of code.

### Data wrangling###

#### <font color=red>**Using scikit-learn to impute missing values</font>**

This comes from a Medium article by Maurizio Sluijmers accessible here: https://levelup.gitconnected.com/scikit-learn-python-6-useful-tricks-for-data-scientists-1a0a502a6aa3

In [None]:
#Simple imputer
from sklearn.impute import SimpleImputer, KNNImputer
import pandas as pd

X.iloc[1, 2] = float('NaN') #here he set a value to NaN

imputer_simple = SimpleImputer()
pd.DataFrame(imputer_simple.fit_transform(X))

In [None]:
#Using KNN imputer to consider the 2 nearest neighbors & weight them uniformly

imputer_KNN = KNNImputer(n_neighbors=2, weights="uniform")
pd.DataFrame(imputer_KNN.fit_transform(X))

#### <font color=red>**Using pandas profiling next as alternative to df.describe</font>**

This comes from a Medium article by Lukas Frei accessible here: https://towardsdatascience.com/speed-up-your-exploratory-data-analysis-with-pandas-profiling-88b33dc53625

In [None]:
pandas_profiling.ProfileReport(df)

#### <font color=red>**Using pandas read_html to scrape web data in pages</font>**

In [None]:
import pandas as pd

tables = pd.read_html("https://apps.sandiego.gov/sdfiredispatch/")

print(tables[0])

#### <font color=red>**Cleaning data in Pandas</font>**

In [None]:
# where age has elements that are represented as '-', this replaces that so it becomes Nan
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

In [None]:
# converting data types
df['Name'] = df['Name'].astype('string')

#### <font color=red>**Clipping outliers</font>**

In [None]:
data['price'] = data['price'].clip(100,125)
data.head()

### 2. Mangling dataframe queries

#### <font color=red>Checking whether conditions in two columns match:</font>

In [None]:
jud_data[jud_data['actual']!=jud_data['predicted']].count()/jud_data.count()

In [None]:
#### <font color=red>Working out a percentage e.g. type 1 errors</font>

In [None]:
jud_data.query('actual=="innocent" and predicted=="guilty"').shape[0]/ jud_data.shape[0]

In [None]:
#### <font color=red>bamboolib - a handy little library!</font>

In [None]:
import bamboolib as bam
import pandas as pd
df = pd.read_csv(bam.titanic_csv)
df

#### <font color=red>Setting up a new column that combines treatment & control in ab_page</font>

In [None]:
df2['ab_page'] = pd.get_dummies(df2['group'])['treatment']
df2.head()

#### <font color=red>Creating two new columns from data in country</font>

In [None]:
dfN[['CA', 'US']]=pd.get_dummies(dfN['country'])[['CA', 'US']]

In [None]:
### #<font color=red>creating a new column that calculates conversion rate for US</font>

### 3. Correlations etc

#### <font color=red>Using a predictive power score</font>
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598

In [None]:
pip install ppscore

import ppscore as pps
pps.score(df, "feature_column", "target_column")

#for the whole pps matrix
pps.matrix(df)

#for visualisation
import seaborn as sns
df_matrix = pps.matrix(df)
sns.heatmap(df_matrix, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

#to plot the predictor
import seaborn as sns
df_predictors = pps.predictors(df, y="y")
sns.barplot(data=df_predictors, x="x", y="ppscore")

### Pandas plot function ###

Pandas has a built in .plot() function as part of the DataFrame class. It has several key parameters:

kind — ‘bar’,’barh’,’pie’,’scatter’,’kde’ etc which can be found in the docs.

color — Which accepts and array of hex codes corresponding sequential to each data series / column.

linestyle — ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs only)

xlim, ylim — specify a tuple (lower limit, upper limit) for which the plot will be drawn

legend— a boolean value to display or hide the legend

labels — a list corresponding to the number of columns in the dataframe, a descriptive name can be provided here for the legend

title — The string title of the plot

### Working with JSON files ###

#### <font color=red>Writing JSON data from a .txt file from an API to a df</font>

In [None]:
# Write JSON data to a dataframe 
import tweepy # for working with twitter

df_list=[]
with open('tweet_json.txt', 'r') as jfile:
    for item in jfile:
        json_data = json.loads(item)
        df_list.append({'tweet_id':json_data['id'], #'id', 'favorite_count,'text' etc are all fields from the API
                        'favorites':json_data['favorite_count'], #NB! see that append is part of the for loop
                        'retweet':json_data['retweet_count'],
                        'text':json_data['full_text']})

df_twit=pd.DataFrame(df_list,columns=['tweet_id','favorites','retweet','text'])
df_twit.head(5)

### Accessing Twitter from API ###

In [None]:
consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)