# Python for Data Analysis and Visualization

---
<html>
<p>
<a href="https://tuftsdatalab.github.io/python-data-analysis/" target="_blank">
        <img src="https://tuftsdatalab.github.io/badges/workshop.svg" alt="Workshop Website" style="float: left;"/></a>
<span style="float:left;">&ensp;</span>
<a href="https://github.com/tuftsdatalab/python-data-analysis/" target="_blank">
        <img src="https://tuftsdatalab.github.io/badges/github.svg" alt="View on GitHub" style="float: left;"/></a>
<span style="float:left;">&ensp;</span>
<a href="https://sites.tufts.edu/datalab/" target="_blank">
        <img src="https://tuftsdatalab.github.io/badges/datalab.svg" alt="datalab.tufts.edu" style="float: left;"/></a>
<span style="float:left;">&ensp;</span>
<a href="https://twitter.com/intent/follow?screen_name=tuftsdatalab" target="_blank">
        <img src="https://tuftsdatalab.github.io/badges/twitter.svg" alt="@TuftsDataLab" style="float: left;"/></a>
<br>
</p>
</html>

**A Tufts University Data Lab Workshop**\
Written by Uku-Kaspar Uustalu

Python resources: [go.tufts.edu/python](https://sites.tufts.edu/datalab/python/)\
Questions: [datalab-support@elist.tufts.edu](mailto:datalab-support@elist.tufts.edu)\
Feedback: [uku-kaspar.uustalu@tufts.edu](mailto:uku-kaspar.uustalu@tufts.edu)

---

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
## Working with Messy Data

In [None]:
grades = pd.read_csv('data/grades.csv')

In [None]:
grades

In [None]:
print(grades)

### Cleaning Column Names

In [None]:
grades.rename(str.lower, axis = 'columns')

In [None]:
grades

In [None]:
grades = grades.rename(str.lower, axis = 'columns')

In [None]:
grades

In [None]:
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)

In [None]:
grades

### Indexing and Datatypes

In [None]:
grades.dtypes

In [None]:
grades['name']

In [None]:
grades.name

In [None]:
grades[['name']]

In [None]:
grades['name'][0]

In [None]:
grades['name'][1]

In [None]:
grades.name[1]

In [None]:
grades['exam1'][1]

In [None]:
grades.exam2[1]

In [None]:
grades['exam3'][1]

In [None]:
grades.exam4[1]

In [None]:
grades

In [None]:
print(type(grades['exam3'][0]))
print(type(grades['exam3'][1]))
print(type(grades['exam3'][2]))

### Assigning Values and Working with Missing Data

In [None]:
grades['exam3'][2] = 0

**Oh no, a really scary warning!** What is happening?

Because Python uses something called *pass-by-object-reference* and does a lot of optimization in the background, the end user (that is you) has little to no control over whether thay are referencing the **original** object or a **copy**. This **warning** is just Pandas letting us know that when using *chained indexing* to write a value, the behaviour is ***undefined***, meaning that **pandas** cannot be sure wheter you are are writing to the **original** data frame or a temporary **copy**.

To learn more: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [None]:
grades

Phew, this time we got lucky. However, with a differet data frame the same approach might actually write the changes to a *temporary copy* and leave the original data frame unchanged. Chained indexing is dangerous and you should avoid using it to **write** values. What should we use instead?

There are a **lot** of options: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

- To write a singe value, use `.at[row, column]`
- To write a range of values (or a single value), use `.loc[row(s), column(s)]`

*Note that `.at` and `.loc` use row and column labels. The numbers 0, 1, 2, 3, and 4 that we see in front of the rows are actually row labels. By default, row labels match row indexes in pandas. However, quite often you will work with rows that have actual labels. Sometimes those labels might be numeric and resemble indexes, which leads to confusion and error. Hence, if you want to make sure you are using _indexes_, not labels, use `.iat` and `.iloc` instead.*

In [None]:
grades.at[2, 'exam3']

In [None]:
grades.loc[3, 'exam2'] = np.NaN

In [None]:
grades

In [None]:
grades.dtypes

In [None]:
grades['exam2'][0]

In [None]:
grades.exam3[0]

In [None]:
grades['exam2'] = pd.to_numeric(grades['exam2'])
grades['exam3'] = pd.to_numeric(grades.exam3)

In [None]:
grades

In [None]:
grades.dtypes

### Aggregating Data

In [None]:
grades['sum'] = grades['exam1'] + grades['exam2'] + grades['exam3'] + grades['exam4']

In [None]:
grades

In [None]:
grades.drop('sum', axis = 'columns', inplace = True)

In [None]:
grades

In [None]:
grades['sum'] = grades.sum(axis = 'columns')

In [None]:
grades

---
## A Better Way of Working with Messy Data

In [None]:
del grades

In [None]:
grades = pd.read_csv('data/grades.csv', na_values = 'excused')
grades

In [None]:
grades.rename(str.lower, axis = 'columns', inplace = True)
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)
grades

In [None]:
grades.dtypes

In [None]:
grades['exam3'] = pd.to_numeric(grades['exam3'], errors = 'coerce')
grades

In [None]:
grades.dtypes

In [None]:
grades['exam3'] = grades['exam3'].fillna(0)
grades

In [None]:
grades['mean'] = grades.mean(axis = 'columns')
grades

In [None]:
grades.loc[:, 'max'] = grades.max(axis = 'columns')
grades.loc['mean'] = grades.mean(axis = 'rows')
grades.loc['max', :] = grades.max(axis = 'rows')
grades

---
## Working with Real Data

In [None]:
avocados = pd.read_csv('data/avocado.csv')

In [None]:
avocados

In [None]:
avocados.head()

In [None]:
avocados.shape

In [None]:
avocados.dtypes

### Subsetting Data using Boolean Indexing

In [None]:
avocados.geography

In [None]:
avocados.geography == 'Boston'

In [None]:
avocados[avocados.geography == 'Boston']

In [None]:
avocados_boston = avocados[avocados.geography == 'Boston']

In [None]:
avocados_boston.head(10)

In [None]:
avocados_boston_copy = avocados[avocados.geography == 'Boston'].copy()

In [None]:
avocados_boston_copy.head(10)

In [None]:
np.mean(avocados_boston.average_price[avocados_boston.year == 2019])

In [None]:
mean_2019 = np.mean(avocados_boston.average_price[avocados_boston.year == 2019])

In [None]:
print("The avereage price for avocados in the Boston area in the year 2019 was: $", round(mean_2019, 2))

### Creating Plots

In [None]:
plt.plot(avocados_boston.date, avocados_boston.average_price)
plt.show()

In [None]:
plt.figure(figsize = (20, 8))
plt.plot(avocados_boston.date, avocados_boston.average_price, color = 'green', linestyle = '--', marker = 'o')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

In [None]:
avocados_boston.plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

In [None]:
avocados_boston[avocados_boston.year == 2019].plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

In [None]:
plt.hist(avocados.average_price)
plt.xlabel('Price')
plt.show()

In [None]:
sns.histplot(avocados.average_price, color = 'r', kde = True)

In [None]:
sns.histplot(avocados.average_price[avocados.year == 2019], color = 'r', kde = True)

In [None]:
sns.histplot(avocados.average_price[avocados.geography == 'Boston'], color = 'r', kde = True)