# Intro/Recap to Data Analysis in Python

## Objectives:
Learn or refresh the following skills:
- Data manipulation
  - Statistics packages
  - File management
  - Pandas
- Data visualization
- Regular expressions  

I believe that one of the most useful skills a data scientist can have is "Google-Fu", the ability to use search engines or search through code documentation to find answers to problems. As such, I will not be giving a complete tutorial here so much as a brief introduction to data analysis tools in Python. I will provide several examples of useful tools and how to use them along with brief notes to clarify some of the more confusing points. I highly recommend searching the documentation for all of the tools we discuss on your own to build your familiarity with the packages as well as to hone your "Google-Fu". Samir and I are more than happy to help with any questions you may have during this process.

## Data Manipulation

### Statistics Packages

We don't have to reinvent the wheel with every project. Using standard packages to do data analysis is usually a much better option then writing data analysis code from scratch. Not only are they more tested, and thus more reliable, code, they're also usually faster to execute than code written from scratch because they use faster languages like C on the backend.  

#### NumPy

NumPy is likely the single most-used package in the Python community. It is the basis upon which almost every other Python module is based. NumPy is based around the idea of arrays and array operations. A NumPy array is like a more structured, possibly multi-dimensional version of a list. While not quite as flexible as a list, an array allows for more powerful numerical operations than lists.  
For more tutorials on how to use NumPy, visit this link: https://numpy.org/learn/

##### Arrays and Operations

In [None]:
# We start with import statements
import numpy as np    # np is the standard alias for numpy
import math           # the math module is for basic mathematical operations in Python but is usually made obsolete by NumPy

# Taking the arithmetic mean of a list of numbers
number_list = [8, 6, 7, 5, 3, 0, 9]
list_avg = sum(number_list)/len(number_list)
print("Average for list:\t\t\t\t", list_avg)

# Taking the arithmetic mean of an array of numbers
number_array = np.array([8, 6, 7, 5, 3, 0, 9])
array_avg = number_array.mean()
print("Average for NumPy:\t\t\t\t", array_avg)

# Alternate way to use NumPy to take the arithmetic mean of a list of numbers
np_list_mean = np.mean(number_list)
print("Average for list using NumPy:\t\t\t", np_list_mean)

# Taking the standard deviation of a list of numbers
list_stdev = math.sqrt(sum([(value - list_avg)**2 for value in number_list])/len(number_list))
print("Standard deviation for list:\t\t\t", list_stdev)

# Taking the standard deviation of an array of numbers
array_stdev = number_array.std()
print("Standard deviation for NumPy:\t\t\t", array_stdev)

# Alternate way to use NumPy to take the standard deviation of a list of numbers
np_list_stdev = np.std(number_list)
print("Standard deviation for NumPy using list:\t", np_list_stdev)

##### Multi-dimensional Arrays

In [None]:
# Multi-dimensional arrays
array_2d = np.array([[1,2,3], [4,5,6], [7,8,9]])    # a 3x3 array of integers
array_3d = np.array([[[1,2,3],[4,5,6],[7,8,9]],     # a 3x3x3 array of integers
                     [[10,11,12],[13,14,15],[16,17,18]], 
                     [[19,20,21],[22,23,24],[25,26,27]]])
print(array_2d)
print("\n")
print(array_3d)

# We can perform operations along different axes
mean_2d_rows = np.mean(array_2d, axis=1)            # mean along 2D rows
print("\nMean of 2D rows:\n", mean_2d_rows)
mean_2d_cols = np.mean(array_2d, axis=0)            # mean along 2D columns
print("Mean of 2D columns:\n", mean_2d_cols)
mean_3d_rows = np.mean(array_3d, axis=2)            # mean along 3D rows
print("\nMean of 3D rows:\n", mean_3d_rows)
mean_3d_cols = np.mean(array_3d, axis=1)            # mean along 3D columns
print("Mean of 3D columns:\n", mean_3d_cols)
mean_3d_other = np.mean(array_3d, axis=0)           # mean along other 3D axis
print("Mean of other 3D axis:\n", mean_3d_other)

##### Array Slicing

In [None]:
# We can select slices of arrays to operate on only part of them
array_2d = np.array([[1,2,3,4], [4,5,6,7], [7,8,9,10], [10,11,12,13]])    # a 4x4 array of integers
array_3d = np.array([[[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13]],      # a 4x4x4 array of integers
                     [[10,11,12,13],[13,14,15,16],[16,17,18,19],[19,20,21,22]], 
                     [[19,20,21,22],[22,23,24,25],[25,26,27,28],[27,28,29,30]]])
print("Original 2D array:\n", array_2d)

# Use a colon to mean "everything up to", "everything past", "everything between", or just "everything"
slice_2d = array_2d[:2,1]                        # Every row up to, but not including, row 2 (the 3rd row); and just column 1 (the 2nd column)
print("\nRows up to 2, column 1:\n", slice_2d)   # Note that this will be in a 1D format, not a 2D format
slice_2d = slice_2d.reshape(-1,1)                # Reshape the previous slice into a 2D format
print("Rows up to 2, column 1:\n", slice_2d)
slice_2d = array_2d[1:3,1:]                      # Rows between 1 (inclusive) and 3 (exclusive); and all columns starting with 1
print("Rows between 1 and 3, columns 1 and up:\n", slice_2d)
slice_2d = array_2d[:,-2:]                      # All rows; and the last 2 columns (everything past -2)
print("All rows, last two columns:\n", slice_2d)

# 3D slice example
print("\nOriginal 3D array:\n", array_3d)
slice_3d = array_3d[:2,1:,-3:]
print("\nThe first two of the other axis, all rows from 1 and up, the last 3 columns:\n", slice_3d)

#### SciPy

SciPy is a NumPy-based library for scientific and technical computing. SciPy contains far more tools than most data scientists will ever use, so we will only cover the most relevant ones here.  
For more documentation on using SciPy, check out this page: https://docs.scipy.org/doc/scipy/tutorial/

##### Stats

In [None]:
# As always, we start with an import statement
import scipy.stats as stats         # If you only need one tool from the stats module, it's usually more efficient to only import that tool

# Linear regression                            # Docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
X = np.arange(4,25)                            # Get the x-coordinates between 4 and 24 (chosen arbitrarily)
Y = 4*X + 2 + np.random.normal(size=len(X))    # Calculate the y-coordinates for a noisy line following y = 4x + 2
results = stats.linregress(X, Y)               # Perform the linear regression
print("Intercept is {}. This should be close to 2.".format(results.intercept))
print("Slope is {}. This should be close to 4.".format(results.slope))

# Descriptive statistics of an array, good for quickly assessing a large dataset
print("\nDescriptive statistics of an array:")
print(stats.describe(Y))
print("The mean calculated with NumPy is {}, which matches the SciPy value of {}.".format(np.mean(Y), stats.describe(Y).mean))

##### Optimization

In [None]:
# Here is the import statement for a SciPy optimization tool to find the minimum of a function
from scipy.optimize import minimize                        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html

# Create an arbitrary, 3D lambda function (essentially a quickly defined Python function)
f = lambda x: x[0]**2 + np.abs(np.exp(x[1])-1) - np.cos(x[2])   # Infinitely many local minima with the same x0- and x1-coordinates

# Find a minimum of the function
x0 = np.random.random(3)
result = minimize(f, x0)
print("The found minimum of this function is {}, which should be close to [0,0,0].".format(result.x))

### File Management

Most of the data you will see in medical informatics comes from files you download. More often than not, these files are formatted as CSV files with rows and columns. CSV files are usually opened on a desktop with a program like Excel. However, data can also come in TXT format, which is raw text. How do we use these files in Python?

#### TXT files
While there are other ways to do this, the following methods are the safest and, in my opinion, easiest ways to get data from a TXT file.  
For more instructions on writing to files in addition to reading from files, check out this tutorial: https://www.geeksforgeeks.org/file-handling-python/

##### With Open Readlines

In [None]:
# This method reads in each line in the file into a list of strings, leaving any '\n' newline characters at the end of each string
with open("hr_features.txt", "r") as file:    # the 'file' object is only accessible within the 'with open' block
    file_text = file.readlines()              # the 'file_text' object is accessible outside of the 'with open' block
print(file_text)

##### With Open Read Splitlines

In [None]:
# This method reads in each line in the file into a list of strings, removing any '\n' newline characters at the end of each string
with open("hr_features.txt", "r") as file:    # the "r" argument means that the file can only be read; we cannot edit the file
    file_text = file.read().splitlines()
print(file_text)

#### CSV files

You'll usually want to use Pandas to import a CSV file directly into a DataFrame, but it's useful to know how to read CSV files that aren't intended for use in a DataFrame.  
A much more detailed tutorial on using the csv package in Python can be found here: https://www.geeksforgeeks.org/working-csv-files-python/

In [None]:
# As always, we start with an import statement
import csv

# Because the file will only exist within the 'with open' block, we initialize a list to hold the data extracted in the 'with open' block
data = []    # Note that a data structure other than a list may be ideal for different applications

# We read in the file with a csv.reader object
with open("iris.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        data.append(row)

# Display a sample of the collected data
for i in range(len(data)//10):
    print(data[i])
print("Note that this data is all formatted as strings. You have to convert data gathered this way from CSVs to numeric form separately.")

### Pandas

In addition to being adorable bears, Pandas is also an incredibly useful tool for data analysis. Pandas is a Python package that revolves around the idea of DataFrames. A DataFrame is an object similar to a NumPy array but with much greater functionality and flexibility. Let's look at the basics:  
For more, see a largely comprehensive guide to the basics here: https://pandas.pydata.org/docs/user_guide/10min.html

#### DataFrame Basics

DataFrames are commonly created from CSV files. Each DataFrame is organized by columns and by indices (equivalent to rows). While columns and indices can be used numerically (i.e., the first index and column are labeled "0", the next index and column are labeled "1", and so forth), it is usually desirable to give the columns or indices human-readable strings as names.

##### Read CSV

In [None]:
# As always, we start with an import statement
import pandas as pd                          # 'pd' is the standard alias for pandas

# Let's import some data to work with from a CSV file
df = pd.read_csv("iris.csv", header=None)    # 'df' is the standard variable name for a generic DataFrame
                                             # We use "header=None" because there is no row dedicated to just column names in the csv
# Let's look at the first few rows of this DataFrame
df.head()

##### Rename Columns

In [None]:
# We had to pass the 'header=None' argument because this CSV file does not contain explicit headers. Let's rename the columns
df = df.rename(columns={0: "sepal_length", 1: "sepal_width", 2: "petal_length", 3: "petal_width", 4: "species"})
df.head()    # Notice that the data here will be converted to numeric form as appropriate rather than string form

##### Look at Index and Columns

In [None]:
# Every DataFrame has indices and columns. The .index attribute has row names and the .columns attribute has column names.
print("Row names in the DataFrame:")
print(df.index)
print("\nColumn names in the DataFrame:")
print(df.columns)

#### Selecting Columns

We often want to look at only one or some of the columns in a DataFrame at a time. We can display or perform operations with only a few columns at a time by selecting the desired columns.

##### Select from One Column

In [None]:
# Let's select only certain parts of the DataFrame. What are the values of the 'sepal_width' column?
sw_df = df["sepal_width"]        # This is the syntax for selecting an entire column
print("Here is the \"sepal_width\" column of the DataFrame:")
print(sw_df)
print("\nNote that selecting a column from a DataFrame returns a", type(sw_df), "object instead of a DataFrame.")
print("A Series is a one-dimensional analog of a DataFrame. The syntax for a Series is similar, but not identical, to that for a DataFrame.")

# We can select values from this Series in a similar way as with a NumPy array
print("\nValue of \"sepal_width\" at index 2:", sw_df[2])
print("Indices 3 (inclusive) through 6 (exclusive) of \"sepal_width\":")
print(df["sepal_width"][3:6])    # Note that it isn't necessary to create a new object to interact with a single column
print("Indices 146 (inclusive) through the second-to-last index (exclusive) of \"sepal_width\":")
print(df["sepal_width"][146:-2])

##### Select from Two or More Columns

In [None]:
# What if we want to analyze two columns simultaneously?
two_col_df = df[["sepal_width", "petal_width"]]         # This is the syntax for selecting multiple entire columns
print("Here are the \"sepal_width\" and \"petal_width\" columns of the DataFrame:")
print(two_col_df)
print("\nNote that selecting multiple columns from a DataFrame returns a", type(two_col_df), "object, not a Series.")

# We can NOT select values from this DataFrame in a similar way as with a NumPy array or a Series
try:
    print("\nValue of \"sepal_width\" and \"petal_width\" at index 2:", two_col_df[2])
except KeyError:
    print("\nThe input index 2 was assumed to be a column name, not an index. We have to use different syntax.")

#### Selecting Slices

Selecting slices is a shortcut to simultaneously select only certain columns _and_ only certain indices. Slice selection can occur using integers with _.iloc[]_ syntax or by using index and column names with _.loc[]_ syntax.

##### Using 'iloc' Syntax

In [None]:
# To select like a NumPy array, we use .iloc[] syntax, which treats indices and columns as integer positions rather than named values

# Here is the syntax for selecting row numbers from a DataFrame with .iloc[]
print("\nValue of \"sepal_width\" and \"petal_width\" at index 2:")
print(two_col_df.iloc[2])
print("Indices 3 (inclusive) through 6 (exclusive) of \"sepal_width\" and \"petal_width\":")
print(df[["sepal_width", "petal_width"]].iloc[3:6])
print("Indices 146 (inclusive) through the second-to-last index (exclusive) of \"sepal_width\" and \"petal_width\":")
print(two_col_df.iloc[146:-2])

# We can select columns by integer position as well, starting from zero
print("\nRow three, column two:", df.iloc[3,2])
print("Rows one to three (exclusive), columns two to four (exclusive):")
print(df.iloc[1:3,2:4])
print("All rows, columns from one to the second-to-last::")
print(df.iloc[:,1:-1])

##### Using 'loc' Syntax

In [None]:
# Indices and columns are normally treated as named values rather than integers

# Here is the syntax for selecting indices, NOT row number, from a DataFrame with .loc[]
print("\nValue of \"sepal_width\" and \"petal_width\" at index 2:")
print(two_col_df.loc[2])
print("Indices 3 (inclusive) through 6 (ALSO INCLUSIVE) of \"sepal_width\" and \"petal_width\":")
print(df[["sepal_width", "petal_width"]].loc[3:6])      # This selection is inclusive at the end because it uses index, not row number
print("Indices 146 (inclusive) through the second-to-last index (exclusive) of \"sepal_width\" and \"petal_width\":")
print(two_col_df.loc[146:-2])                           # This syntax assumes "-2" is an index name, but "-2" isn't in the index

# To select rows and columns by name, we use .loc[] syntax.
print("\nRow 2, column \"sepal_length\":", df.loc[2,"sepal_length"])
print("Rows 2 through 4 INCLUSIVE, columns \"petal_length\" through \"species\" INCLUSIVE:")
print(df.loc[2:4,"petal_length":"species"])

#### Advanced DataFrame Editing

In addition to changing column names, we can edit DataFrames by changing index names and adding columns manually. It's always advisable to double-check your code before changing index names since in many cases the index itself contains information about a subject of interest.

##### Change Index Values

In [None]:
# To further illustrate the difference between indices and row numbers, note that we can change index values to strings
df.rename(index={0:"a", 1:"b", 2:"c"}, inplace=True)    # Using "inplace=True" means we edit the DataFrame directly, no "df =" needed
df.head()

In [None]:
# We reset the DataFrame here so as to not cause problems with the code later
df = pd.read_csv("iris.csv", header=None)
df = df.rename(columns={0: "sepal_length", 1: "sepal_width", 2: "petal_length", 3: "petal_width", 4: "species"})

##### Add Columns Manually

In [None]:
# Adding a new column is quite simple. The easiest way is to call the new column from the DataFrame
df["extra_column"] = 0    # This fills the whole columns with 0's
df.head()

In [None]:
# We can now change the values of the new column to be whatever we wish
for i, idx in enumerate(df.index):
    df.loc[idx,"extra_column"] = i**2
df.head()

In [None]:
# We can copy values from a different column of a DataFrame into our new column too
df["extra_column"] = df["sepal_width"].copy()    # Using .copy() is VERY IMPORTANT; otherwise changes in one column will reflect in the other
df.head()

#### Null Value Handling

A very common challenge in data science, especially in medical informatics, is missing data. There are several ways to handle missing values, some more sophisticated than others. Sometimes our best course of action is to simply drop rows or columns with many null or missing values from consideration. Other times it's best to fill the null values with a value that is representative of the population of interest. Many more advanced techniques also exist, all of which fall beyond the scope of this tutorial.

##### Drop Null Values

In [None]:
# We sometimes need to ignore parts of the data that are missing important information. Let's add missingness to our data for illustration.
df["extra_column"].iloc[np.random.choice(np.arange(len(df)), len(df)//10)] = np.nan
df["extra_column"].values

In [None]:
# If there are few enough missing values in our DataFrame, we can just drop the rows that contain missing values
print("Previous number of rows:", len(df))
df_drop_rows = df.dropna(axis="index")    # By not using "inplace=True", we can keep an original copy and an edited copy of the DataFrame
print("New number of rows:", len(df_drop_rows))

In [None]:
# Note that this does NOT cause the index to renumber itself
df_drop_rows.index

In [None]:
# We can drop an entire column if there are too many missing values for the column to be useful.
df["extra_column"].iloc[np.random.choice(np.arange(len(df)), len(df)//2)] = np.nan
df

In [None]:
df = df.dropna(axis="columns")    # Note that this will drop ALL columns that have ANY missing values.
df

In [None]:
# In many cases, it's safer to manually drop columns with high levels of missingness than it is to drop all columns with missing data
df.drop(columns="species", inplace=True)    # Using "inplace=True" means we don't use "df =" syntax
df

In [None]:
# We reset the DataFrame here so as to not cause problems with the code later
df = pd.read_csv("iris.csv", header=None)
df = df.rename(columns={0: "sepal_length", 1: "sepal_width", 2: "petal_length", 3: "petal_width", 4: "species"})
df["extra_column"] = df["sepal_width"].copy()

##### Fill Null Values

In [None]:
# Sometimes we want to replace missing values with an approximation of what they might have been. This is called imputation.
df["extra_column"].iloc[np.random.choice(np.arange(len(df)), len(df)//10)] = np.nan     # Adding missingness for illustration purposes
df["extra_column"].values

In [None]:
# A simple method for data imputation is to use the average of the known values in a column
column_mean = df["extra_column"].mean()
print("We will replace null values with the column mean, which is", column_mean)
df["extra_column"].fillna(df["extra_column"].mean(), inplace=True)
print(df["extra_column"].values)
print("Note that the imputed values have a very different level of precision than the true values.")
print("This may be valid or invalid based on your particular dataset or problem.")
print("For example, a dataset with only integer values might not behave well with this kind of data imputation. Be careful.")

#### More Advanced DataFrame Operations

A very powerful capability of DataFrames is the ability to view or operate on only a specific subset of the data, even if that subset occupies an unknown location within a DataFrame or exists scattered across several different DataFrames. You can select indices based on whatever criteria you wish. The possibilities are endless so long as you know how to translate your ideas into code.

##### Select Where

In [None]:
# We can select parts of a DataFrame based on more advanced criteria
print("All flowers with top decile sepal length:")
top_decile = np.percentile(df["sepal_length"],90)
df.where(df["sepal_length"] >= top_decile).dropna()    # .where() returns the original DataFrame with non-selected rows as null values

##### Join

In [None]:
# We can combine DataFrames with similar data. Let's make some fake data by converting two dictionaries to DataFrames
dict_1 = {"Taste":{"Pizza Planet":5, "A-1 Pizza":7, "Just Ray's Pizza":6}, 
          "Store Size":{"Pizza Planet":7, "A-1 Pizza":2, "Just Ray's Pizza":5}, 
          "Manager":{"Pizza Planet":"Andy", "A-1 Pizza":"Leonardo", "Just Ray's Pizza":"Fred"}
         }
dict_2 = {"Distance":{"Pizza Planet":200, "A-1 Pizza":100, "Pizza Factory":300}}
df_1 = pd.DataFrame(dict_1)
df_2 = pd.DataFrame(dict_2)
print("DataFrame 1:")
print(df_1)
print("\nDataFrame 2:")
print(df_2)

In [None]:
# Whichever DataFrame is on the "left" determines the index for the new DataFrame and the order of the columns
df_1_on_left = df_1.join(df_2)
df_2_on_left = df_2.join(df_1)
print("Join with DataFrame 1 on the left:")
print(df_1_on_left)
print("\nJoin with DataFrame 2 on the left:")
print(df_2_on_left)
print("\nNote that index-column pairs that did not exist before the join are assigned a NaN value.")
print("We also cannot join DataFrames with duplicate column names. We can get around this by only selecting some of the columns.")

In [None]:
# If we do not want any NaN values resulting from indices that only occur in one DataFrame, we can use an "inner" join
df_inner = df_1.join(df_2, how="inner")
print("Inner join:")
print(df_inner)

In [None]:
# If we want to preserve all indices, we can use an "outer" join
df_outer = df_1.join(df_2, how="outer")
print("Outer join:")
print(df_outer)

##### Merge

In [None]:
# Merging allows us to combine DataFrames based on columns rather than the index. Let's make some new data for this.
dict_3 = {"Sales":{"Andy":9500, "Leonardo":6700, "Fred":15200}, 
          "Height (cm)":{"Andy":175, "Leonardo":196, "Fred":180}
         }
df_3 = pd.DataFrame(dict_3)
print("DataFrame 3:")
print(df_3)

In [None]:
# Let's merge based on manager name, which is df_3's index but a column in df_1
df_merge = df_1.merge(df_3, left_on="Manager", right_on=df_3.index)
print("Merged:")
print(df_merge)

In [None]:
# Note that the resulting DataFrame has an arbitrary, numeric index. We can use 'set_index' to set "Manager" as the index if we wish.
df_merge_manager = df_merge.set_index("Manager")
print("Merged with \"Manager\" as index:")
print(df_merge_manager)

In [None]:
# If we're confident that row order was preserved we can use the index from df_1 instead
df_merge_original_index = df_merge.rename(index={i:idx for i,idx in enumerate(df_1.index)})
print("Merged with index from DataFrame 1:")
print(df_merge_original_index)

## Data Visualization

We as humans have a far better intuition for visual data than we do for numbers. It can be extremely useful to look at data visually before performing other kinds of data analysis to get an intuition for what kind of analyses would be appropriate. Moreover, in many instances you may be required to present your findings in a visual format, and it's important to make these presentations easy to understand.

In [None]:
# Let's re-import the data just in case
import pandas as pd
df = pd.read_csv("iris.csv", header=None)
df = df.rename(columns={0: "sepal_length", 1: "sepal_width", 2: "petal_length", 3: "petal_width", 4: "species"})

### Pandas

Another strength of Pandas is its ability to quickly produce displays from DataFrames. While Pandas may not have the same visualization versatility as other packages in Python, it is one of the easier packages to use because of its direct connection to the data you already have.  
A useful guide for more Pandas visualization techniques: https://pandas.pydata.org/docs/user_guide/visualization.html

In [None]:
# If we want to make a quick and dirty visualization of a Pandas DataFrame, we can simply use the ".plot()" function
df.plot()

In [None]:
# That really was not a very useful visualization at all. Let's try refining it.
df.plot(kind="hist")

In [None]:
# That's better, but the bars of the histograms all overlap each other. Let's fix that.
df.plot(kind="hist", subplots=True)

In [None]:
# Let's make the plot easier to read by arranging the subplots in a 2x2 grid layout and increasing the figure size
df.plot(kind="hist", subplots=True, layout=(2,2), figsize=(8,8))

In [None]:
# Let's just look at one column of data
df["petal_width"].plot(kind="hist")

In [None]:
# If we're only interested in higher-level data such as quartiles and outliers, we can use a box plot
df.plot(kind="box")

### Matplotlib

The backend for plotting in Pandas uses the Matplotlib library, which is perhaps the most comprehensive data visualization library in all of Python. Matplotlib is extremely useful, and its basics are fairly easy to master, but there is far too much functionality available to cover adequately here. I highly recommend looking through the Matplotlib documentation and tutorials for more information.  
https://matplotlib.org/stable/tutorials/index.html

#### Pyplot

Pyplot is the package within Matplotlib that you will likely use the most. It is the standard interface for Matplotlib.

##### Basic Syntax

In [None]:
# We always start with import statements, and the import statement for Pyplot is a little different
from matplotlib import pyplot as plt    # plt is the standard alias for pyplot in the Matplotlib community
# Let's plot a histogram by calling the ".hist()" function from Pyplot on our data
plt.hist(df["petal_length"])
# It's good form to always add a title and label your axes
plt.title("Iris Dataset: Petal Length")
plt.ylabel("Count")
plt.xlabel("Petal Length (cm)")
# The ".show()" function displays the graph that you have built so far
plt.show()

##### Subplots

In [None]:
# We use subplots to plot multiple graphs next to each other at once

# Here's the graph for petal length
plt.subplot(221)    # This tells Pyplot to put the following graph in the first spot of a 2x2 grid
plt.hist(df["petal_length"])
# It's good form to always add a title and label your axes
plt.title("Petal Length")
plt.ylabel("Count")
plt.xlabel("Petal Length (cm)")

# Here's the graph for petal width
plt.subplot(222)    # This tells Pyplot to put the following graph in the second spot of a 2x2 grid
plt.hist(df["petal_width"])
plt.title("Petal Width")
plt.ylabel("Count")
plt.xlabel("Petal Width (cm)")

# Here's the graph for sepal width
plt.subplot(2,2,3)    # This alternate syntax tells Pyplot to put the following graph in the third spot of a 2x2 grid
plt.hist(df["sepal_width"])
plt.title("Sepal Width")
plt.ylabel("Count")
plt.xlabel("Sepal Width (cm)")

# Here's the graph for sepal length
plt.subplot(2,2,4)    # This alternate syntax tells Pyplot to put the following graph in the fourth spot of a 2x2 grid
plt.hist(df["sepal_length"])
plt.title("Sepal Length")
plt.ylabel("Count")
plt.xlabel("Sepal Length (cm)")

# We can add a title above the whole figure in addition to the titles in each subplot
plt.suptitle("Iris Dataset")

# Display the graphs
plt.tight_layout()    # Using plt.tight_layout() is important to prevent axis labels and titles from overlapping each other
plt.show()

##### Fine-Tuning

In [None]:
# Note that the edges of the bars of our histograms are spaced at weird intervals. We can use the "bins" argument to change that.
plt.hist(df["petal_length"], bins=np.linspace(0,8,9))    # np.linspace() creates an evenly spaced array; 9 numbers between 0 and 8 (incl)
# As always, we add a title and axis labels
plt.title("Iris Dataset: Petal Length")
plt.ylabel("Count")
plt.xlabel("Petal Length (cm)")
# Display the new plot
plt.show()

In [None]:
# If we would like finer granularity, we can add more bins by making a longer array with np.linspace()
plt.hist(df["petal_length"], bins=np.linspace(0,8,81))
# As always, we add a title and axis labels
plt.title("Iris Dataset: Petal Length")
plt.ylabel("Count")
plt.xlabel("Petal Length (cm)")
# Display the new plot
plt.show()

In [None]:
# That granularity might be too fine. Let's try that again
plt.hist(df["petal_length"], bins=np.linspace(0,8,41))
# As always, we add a title and axis labels
plt.title("Iris Dataset: Petal Length")
plt.ylabel("Count")
plt.xlabel("Petal Length (cm)")
# Display the new plot
plt.show()

In [None]:
# If we would like to see proportions rather than absolute counts, we can use the "density" argument
plt.hist(df["petal_length"], bins=np.linspace(0,8,41), density=True)
# Since this is based on proportions, it is useful to ensure that our y-axis runs from 0 to 1 instead of the auto-generated limits
plt.ylim(0,1)
# As always, we add a title and axis labels
plt.title("Iris Dataset: Petal Length")
plt.ylabel("Proportion")
plt.xlabel("Petal Length (cm)")
# Display the new plot
plt.show()

In [None]:
# We can change the display color with the "color" argument, and we can add labels for a legend with the "label" argument
plt.hist(df["petal_length"], bins=np.linspace(0,8,41), density=True, color="g", label="Petal Length")
# Since this is based on proportions, it is useful to ensure that our y-axis runs from 0 to 1 instead of the auto-generated limits
plt.ylim(0,1)
# We can add a legend if we wish
plt.legend(loc="upper left")
# As always, we add a title and axis labels
plt.title("Iris Dataset: Petal Length")
plt.ylabel("Proportion")
plt.xlabel("Petal Length (cm)")
# Display the new plot
plt.show()

#### Mplot 3D

Plotting in 3D can be very useful, although it is somewhat more complicated. Instead of using basic Pyplot syntax, we must use more advanced syntax.

In [None]:
# We first create a figure object and an axes object with a 3D projection
fig = plt.figure(figsize=(9,9))    # We set the figure size here if we don't want to use the default size
ax = fig.add_subplot(projection='3d')

# We use the axes object to add a title and labels
ax.set_title("Iris Dataset")
ax.set_xlabel("Petal Length")
ax.set_ylabel("Petal Width")
ax.set_zlabel("Sepal Width")

# We can plot each entry in the iris dataset as a point in 3D space using a 3D scatterplot
ax.scatter(df["petal_length"], df["petal_width"], df["sepal_width"])

# We can set the viewing angle (elevation and azimuth) before displaying the data
ax.view_init(20, 25)

# We display the graph as usual
plt.show()

In [None]:
# If we want our plot to be interactive in Jupyter Notebook, we use the following command:
%matplotlib widget
# Note that this will only apply to graphs generated after we run this command

In [None]:
# This is the same code as above
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(projection='3d')

# We use the axes object to add a title and labels
ax.set_title("Iris Dataset")
ax.set_xlabel("Petal Length")
ax.set_ylabel("Petal Width")
ax.set_zlabel("Sepal Width")

# We can plot each entry in the iris dataset as a point in 3D space using a 3D scatterplot
ax.scatter(df["petal_length"], df["petal_width"], df["sepal_width"])

# We can set the viewing angle (elevation and azimuth) and figure size before displaying the data
ax.view_init(20, 25)

# We display the graph as usual
plt.show()

In [None]:
# To revert to non-interactive graphs, we use the following command:
%matplotlib inline

### Seaborn

Seaborn is a Python package built on top of Matplotlib. It is more focused on displaying statistical data, whereas Matplotlib is designed to be more flexible. Seaborn can more easily produce more attractive plots than Matplotlib can alone when displaying statistical information.  
Seaborn's official tutorials: https://seaborn.pydata.org/tutorial.html  
A perhaps more user-friendly tutorial: https://www.geeksforgeeks.org/python-seaborn-tutorial/

In [None]:
# As per usual, here is our import statement with the community-standard alias
import seaborn as sns

# Let's get the correlation between each of the four variables in the iris dataset
corr = df.corr()
print("Correlation matrix for each variable in the iris dataset:")
print(corr, end="\n\n")

# Let's visualize this data with a heatmap
ax = sns.heatmap(corr)
plt.show()

In [None]:
# It may be easier to understand the data with a different colormap; more colormaps at https://seaborn.pydata.org/tutorial/color_palettes.html
ax = sns.heatmap(corr, cmap="viridis")
plt.show()

In [None]:
# We can annotate the cells in the heatmap to have visual and numeric data simultaneously
ax = sns.heatmap(corr, cmap="magma", annot=True)
plt.show()

## Regular Expressions

A regular expression, also known as a _regex_ or an _re_, is a powerful tool for handling textual data. In short, a regular expression acts as an advanced search query to find strings that match the query. You can practice with regular expressions at https://regex101.com.  
This is a good example tutorial for learning regular expressions: https://towardsdatascience.com/a-very-easy-tutorial-to-learn-python-regular-expression-re-c42fbbc01ef2

### Basic Python Usage

The basic syntax in Python for regular expressions isn't too complicated. We can save query patterns for later by compiling pattern strings into a Pattern object, then we can pass strings to be searched for the pattern into functions of the Pattern object to determine if a match exists. For most of these functions, if there is no match, None is returned (which always evaluates to False in Python), while a Match object (which always evaluates to True in Python) is returned if there is a match. Each Match object contains fairly detailed information about the match between the pattern string and the string that was searched. Other functions return lists of variable length containing strings that match the pattern.

In [None]:
# Here's our import statement
import re

# We compile our query string into a Python regex object so we can use it more easily later
cat_finder = re.compile("cat")

# Here are test strings to search
test_str_1 = "I like cats"    # The query is in this string
test_str_2 = "I like dogs"    # The query is not in this string

# Calling ".search()" on our regex object returns a match object if the any part of the string matches the regular expression, None otherwise
search_1 = cat_finder.search(test_str_1)
search_2 = cat_finder.search(test_str_2)
print("Does \"{}\" contain {}? {}".format(test_str_1,cat_finder,bool(search_1)))
print("Does \"{}\" contain {}? {}\n".format(test_str_2,cat_finder,bool(search_2)))

# Calling ".findall()" on our regex object returns a list of matches
findall_1 = cat_finder.findall(test_str_1)
findall_2 = cat_finder.findall(test_str_2)
print("How often does \"{}\" contain {}? {}".format(test_str_1,cat_finder,len(findall_1)))
print("How often does \"{}\" contain {}? {}\n".format(test_str_2,cat_finder,len(findall_2)))

# Calling ".match()" on our regex object returns a match object if the beginning of the string matches the regular expression, None otherwise
match_1 = cat_finder.match(test_str_1)
match_2 = cat_finder.match(test_str_2)
print("Does \"{}\" match {}? {}".format(test_str_1,cat_finder,bool(match_1)))
print("Does \"{}\" match {}? {}\n".format(test_str_2,cat_finder,bool(match_2)))

### Special Characters

The special characters are what separates regular expressions from a simple Ctrl+F or Cmd+F search. These special characters allow us to specify allowed variants, delimit groups, and set up simple logical expressions.

#### Character Sets

Using "\[ \]" indicates that any one character, but only one character, within the square brackets (the "set") can count as a match.

In [None]:
# This regex will match both "gray" and "grey", useful for if we don't know if a document uses American English or British English
grey_gray = re.compile("gr[ae]y")

# Let's search each of these test strings
test_strings = ["gray", "grey", "groy", "graey"]
for string in test_strings:
    if grey_gray.match(string):
        print(string, "matches", str(grey_gray))
    else:
        print(string, "does not match", str(grey_gray))

In [None]:
# This regex will match a range of characters alphabetically between 'n' and 'z', the last half of the alphabet. It is case sensitive
last_half = re.compile("[n-z]")    # The '-' in this context means to look at a range of characters

# Let's search each of these test strings
test_strings = ["a", "e", "i", "o", "u", "y", "A", "E", "I", "O", "U", "Y"]
for string in test_strings:
    if last_half.match(string):
        print(string, "matches", str(last_half))
    else:
        print(string, "does not match", str(last_half))

In [None]:
# This regex will match a range of characters alphabetically between 'n' and 'z', the last half of the alphabet. It is case insensitive
last_half = re.compile("[n-zN-Z]")    # The '-' in this context means to look at a range of characters

# Let's search each of these test strings
test_strings = ["a", "e", "i", "o", "u", "y", "A", "E", "I", "O", "U", "Y"]
for string in test_strings:
    if last_half.match(string):
        print(string, "matches", str(last_half))
    else:
        print(string, "does not match", str(last_half))

In [None]:
# This regex will match any character that is NOT alphabetically between 'n' and 'z', the last half of the alphabet. It is case insensitive
last_half = re.compile("[^n-zN-Z]")    # The '^' in this context is a negator character indicating to use the complement of the set

# Let's search each of these test strings
test_strings = ["a", "e", "i", "o", "u", "y", "A", "E", "I", "O", "U", "Y"]
for string in test_strings:
    if last_half.match(string):
        print(string, "matches", str(last_half))
    else:
        print(string, "does not match", str(last_half))

#### Built-In Sets of Characters

Using a "\\" followed by a certain letter will cause the regular expression to match a predefined, built-in set of characters. Here is a list of some of the more useful ones. The full list can be found at https://docs.python.org/3/library/re.html#regular-expression-syntax  
- <b>\d:</b> matches any digit; for ASCII characters only, this is equivalent to \[0-9\]  
- <b>\D:</b> matches anything that is _not_ a digit; for ASCII characters only, this is equivalent to \[^0-9\]  
- <b>\s:</b> matches any whitespace character
- <b>\S:</b> matches any character that is _not_ whitespace
- <b>\w:</b> matches any "words" character; for ASCII characters only, this is equivalent to \[a-zA-Z0-9_\]
- <b>\W:</b> matches any character that is not a "words" character; it is the complement of \w
- <b>. :</b> matches literally any character except a newline; to find an actual period only, use <b>"\\."</b>

In [None]:
# A toy example of character matching
toy_regex = re.compile("\w\s\S\w\W\d\d\D")

# Let's search each of these test strings
test_strings = ["I am 35!", "Are you 35?", "I  am025", "O my 37!"]
for string in test_strings:
    if toy_regex.match(string):
        print("\"{}\" matches {}".format(string,toy_regex))
    else:
        print("\"{}\" does not match {}".format(string,str(toy_regex)))

#### Counting Characters

Any of the following characters can be used to specify the number of times to match a certain expression.
- <b>* :</b> matches any number of repetitions of the preceding expression, including zero times.
- <b>+ :</b> matches any non-zero number of repetitions of the preceding expression
- <b>? :</b> matches exactly zero or one of the preceding expression
Using parentheses groups characters together such that the counting characters apply to the whole group.

In [None]:
# Another toy example of character matching
toy_regex_2 = re.compile("The data are( not)? reliable:\s*\d+ bugs? found")

# Let's search each of these test strings
test_strings = ["The data are reliable: 0 bugs found\t", 
                "The data are not reliable: 1 bug found", 
                "The data are not reliable:3 bugs found", 
                "The data are not reliable:   18 bugs found", 
                "The data are not reliable:\t18 bugs found", 
                "The data are not not reliable: 0 bugs found", 
                "The data are reliable: no bugs found\t", 
               ]
for string in test_strings:
    if toy_regex_2.match(string):
        print("\"{}\" \tDOES match the regex".format(string))
    else:
        print("\"{}\" \tdoes NOT match the regex".format(string))

#### Using 'Or'

The vertical pipe "|" matches if either expression on either side of it matches. These can be strung together in long chains using parentheses.

In [None]:
# Another toy example of character matching
toy_regex_3 = re.compile("(Richard Nixon|George Washington|Millard Fillmore) was a president of the US")

# Let's search each of these test strings
test_strings = ["Richard Nixon was a president of the US", 
                "Nixon was a president of the US\t", 
                "George Washington was a president of the US", 
                "Millard was a president of the US\t", 
                "Millard Fillmore was a president of the US", 
                "Oprah Winfrey was a president of the US"
               ]
for string in test_strings:
    if toy_regex_3.match(string):
        print("\"{}\" \tDOES match the regex".format(string))
    else:
        print("\"{}\" \tdoes NOT match the regex".format(string))

### Finding Groups

By delimiting groups of characters in parentheses, we can cause Python to return useful information from strings. The match object returned by ".match" or ".search" has a "group" attribute that can be used to return these groups.

In [None]:
# Let's create a DataFrame of medical information from a set of free-text strings
medical_text = ["The patient, Mr. Juanes, is 180 lbs. and 30 years old.",
                "This patient, María, weighs 240 pounds and is 29 years old.", 
                "Today's patient, Ramón, weighs in at 172 pounds and turned 18 years old recently.", 
                "The newborn patient, Steve, only weighs 1 pound. He is 0 years old today."
               ]

# We initialize a dictionary to hold the information to be turned into a DataFrame
med_info_dict = {}

# We create a regex to find the name, weight, and age of each patient
patient_info_finder = re.compile(".+patient, ([\w\. ]+),\D+(\d+) (lbs?\.|pounds?)\D+(\d+) years old")    # Each () here is a returnable group
# ".+" indicates that an arbitrary, non-zero number of characters precedes the word "patient"
# "patient, " is to ensure that the patient's name is in the immediately following section of text
# "([\w\. ]+)," captures any non-zero number of letters, periods, and spaces preceeding a comma, which should be the patient's name
# "\D+(\d+)" captures the first set of digits following a string of non-numeric characters
# " (lbs?\.|pounds?)" is to ensure that the string of digits corresponds to a weight in pounds; the "s?"s are for plurals and singulars
# Since " (lbs?\.|pounds?)" is technically a capturing group, but we don't wish to capture this group, we skip .group(3) in the code below
# "\D+(\d+)" captures the next set of digits following a string of non-numeric characters
# " years old" ensures that the previous string of digits corresponds to an age in years

# We fill the dictionary with information from the regular expressions
for text in medical_text:
    if patient_info_finder.search(text):    # This if-statement is necessary to avoid calling ".group" on a None object
        name = patient_info_finder.search(text).group(1)
        weight = patient_info_finder.search(text).group(2)
        age = patient_info_finder.search(text).group(4)
        med_info_dict[name] = (weight, age)

# We convert the dictionary into a DataFrame
medical_df = pd.DataFrame(columns=["Age (years)", "Weight (lbs)"])
for name in med_info_dict.keys():
    medical_df.loc[name,"Age (years)"] = med_info_dict[name][1]
    medical_df.loc[name,"Weight (lbs)"] = med_info_dict[name][0]
medical_df

## References

https://docs.python.org/3/library/re.html  
https://pandas.pydata.org/  
https://numpy.org/  
https://scipy.org/  
https://archive.ics.uci.edu/ml/datasets/iris

Made by Adam Kotter, copyright 2022