# Day 1 Lab: Intro to Pandas

Outline

- Introduction to Pandas
- Working with data in Pandas
- Visualizing data in Pandas








# Notebook Instructions

1. Save a copy of the forked notebook to Google Drive (File >> Save a copy in Drive).  This is the only way you'll be able to save your changes.
2. Make sure to retitle the copy.  The title should include the **lab number** and **your name**.
3. Do the Python coding as directed below (look for the "Your Turn" sections).
4. When you have completed your coding save your notebook in Drive (File >> Save) and save it also to your GitHub repo (File >> Save a copy in github).
5. Navigate to your notebook file in GitHub and copy and submit the URL for the Canvas lab assignment.

# Introduction to Pandas

Pandas is a powerful data manipulation library in Python. It provides data structures and functions that make working with structured data more convenient and efficient compared to base Python.

Pandas introduces two primary data structures: **Series** and **DataFrames**.
- A Series is a one-dimensional collection of values ( the elements must be the same data type).
- A DataFrame is a two-dimensional data structure with columns of potentially different types. It is similar to a spreadsheet.

Pandas provides a wide range of functions and methods for data manipulation, cleaning, filtering, aggregation, merging, and more. It integrates well with other libraries in the data science ecosystem, such as Matplotlib for data visualization.

Some key advantages of using Pandas over base Python include:
- Efficient memory usage and performance for large datasets
- Intuitive syntax for data manipulation
- Built-in functions for common data operations (e.g., filtering, grouping, reshaping)
- Seamless integration with other data science libraries
- Handling of missing data and data alignment
- Powerful tools for data exploration and analysis

Throughout this lab, we will explore various aspects of Pandas and its capabilities for data manipulation, visualization and analysis.



# Working with Data Using Pandas

## Load Libraries

In this class we will be using
- Pandas
- Matplotlib

In [None]:
import pandas as pd
import matplotlib as mpl
import statsmodels.api as sm # This library includes the mtcars data


Note:  Any text preceded by a hashtag, `#`, in python is not evaluated.  We say such text is "commented out." Comments are a way to make detailed notes on the code.

## Getting Data into Pandas

In this case we will load data from the statsmodels.org library. `mtcars` is a common practice dataset.


In [None]:
# Download data from the statsmodels API
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data

# Define dataset as a pandas dataframe
df = pd.DataFrame(mtcars)


## Preview Data

Here is a data dictionary:

|Attribute | Description |
|---------|-----------|
|  mpg    |Miles/(US) gallon|
|  cyl    |Number of cylinders|
|  disp   |Displacement (cu.in.)|
|hp   |Gross horsepower|
|drat |Rear axle ratio|
|wt   |Weight (1000 lbs)|
|qsec |1/4 mile time|
|vs   |Engine (0 = V-shaped, 1 = straight)|
|am   |Transmission (0 = automatic, 1 = manual)|
|gear |Number of forward gears|


In [None]:
#look at the top rows with head()
df.head()

To get information on a function (such as arguments and examples) use the `help()` function:



In [None]:
help(df.head)

As we can see, this function has a single argument, `n`, with a default setting of 5.  `n` defines how many rows to return.

In [None]:
#look at last rows with tail().
# The default for n is again 5.

df.tail(n = 10)

In [None]:
# get a statistical summary of the numeric variables in a dataset
df.describe()

Some of these summary statistics will be familiar but what about:  

- `std`?
- `25%`?
- `50%`?
- `75%`?

In [None]:
# find out how variables are encoded in a dataset
df.info()

## Your Turn:  Use AI to learn

1. Click on the "Gemini" button in the upper right corner of Colab.
2. Type "what is float64" in the prompt box.
3. Type "what does floating point mean" in the prompt box.
4. Type "what is int64" in the prompt box.


## Dealing with Errors

In [None]:
df.info(n = 10)

Use AI to Debug Code:

- Click the "Explain Error" to automatically invoke Gemini
- Copy and paste the code and the error message into an external LLM like ChaptGPT or Claude

## Pandas Syntax

Pandas is designed to be used primarily with dataframes.  The syntax `df.method()` (where "method" could be "head" or "tail" or "info"), indicates that you are using a function on the DataFrame object `df`.

So, above, we used `df.head()` etc, meaning that the `head()` function is being used on the DataFrame object, `df`.

## Your Turn: Summarize Data with Pandas

- Download insurance data.  (I think Matt P showed you this dataset.)

In [None]:
idf = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/insurance.csv")



1. Display top and bottom rows of  `idf`.


2. What is the average expense? (Hint:  use the functions above.  Later on we'll find other ways of doing the calculation.)

3. What is the median age?

In [None]:
# Your code goes here


In [None]:
# Your code goes here


In [None]:
# Your code goes here


# Visualizing Data Using Pandas


Pandas can create a variety of plots with the `df.plot(kind = "...")` syntax (where plotting is an attribute of the dataframe). (It is using the Matplotlib library in the background.)

Here are examples.

## Line Chart

In [None]:
#Line chart
df['mpg'].plot(kind = "line", color='blue')

- `df['mpg']`: defines a Pandas series
- `.plot()`: calls the generic plot function
- `kind = "line"`: an argument to the `plot()` function that defines the type of plot
- `color = "blue"`:   another argument to the `plot()` function that defines the color of the plot, in this case the line

Incidentally, we can use single ('...') or double ("...") quotation marks interchangeably in Python to define a string.

Most functions have a lot of possible arguments.  For example, we can rotate the labels of the cars for legibility (`rot = `), and define the axis labels (`xlabel`, `ylabel`).

In [None]:
# Fix the  x-axis labels
df['mpg'].plot(kind = "line",
               color='blue',
               rot = 90,
               xlabel = "Car",
               ylabel = "Miles per gallon")


How would you add a title? Yes, it would be `plot(title = "...")`. (Plots need titles!)

## Bar Chart

In [None]:
# Bar chart
df['mpg'].plot(kind = "barh",
               color='red',
               xticks=[0,10,20],
               title = "Car MPG")

Pandas automatically uses the row names as the labels on the left.

Why have we used `kind = "barh"`?  This stands for "horizontal barplot", which better accommodates the row names, compared to the default which would put the row names horizontally on the horizontal axis.

How would we get appropriate axis labels?

In [None]:
df['mpg'].plot(kind = "barh",
               color='red',
               xticks=[0,10,20],
               title = "Car MPG",
               xlabel = "Miles per Gallon",
               ylabel = "Cars")

## Histogram

In [None]:
#Histogram
df['mpg'].plot(kind = "hist",
               bins=15,
               title='Miles Per Gallon')

## Your Turn:  Customize the plot

- Add appropriate x- and y- axis labels

In [None]:
# Your code goes here


## Boxplot

Boxplots are best for showing the relationship between a numeric and a categorical variable. Specifically, they reveal the distribution of the numeric variable at different levels of the categorical variable.

Here is a box plot showing the distribution for just one numeric variable.

In [None]:
 # Boxplot
 df.plot(kind = "box", column = "mpg")

And here is a box plot showing the relationship between a categorical and a numeric variable

In [None]:
# Boxplot showing distribution of mpg at levels of cyl

df.plot(kind = "box", column = "mpg", by = "cyl")

## Scatterplot

In [None]:
# Scatter plot
df.plot(kind = "scatter", x = 'mpg',y = 'hp',c = 'wt')

There are a couple of things to notice here.
1. This plot shows the relationship between *three* variables: `mpg`, `hp`, and `wt`.  
2. `wt` is a continuous variable, so the color scale is appropriately continuous.  The legend on the right uses hue to represent gradations.

Suppose that the color scale was used to represent *discrete* differences between categories. Here is what that would look like:

In [None]:
# First make cyl explicitly a categorical variable
df["cyl"] = df["cyl"].astype("category")

# Now use cyl to set the color scheme
df.plot(kind = "scatter", x = 'mpg',y = 'hp',c = 'cyl', cmap='viridis')

## Your Turn: Visualizing Relationships

Using the `idf` dataset:

1. Make a plot showing the distribution of `bmi` by `sex`.
2. Make a plot showing how the relationship between `bmi` and `expenses` differs by `region`.
3. Does the relationship between `bmi` and `expenses` differ by `age`?

Make interpretive comments on all your plots.

In [None]:
# Your code goes here


In [None]:
# Your code goes here


In [None]:
# Your code goes here

# Summary of Functions

Here's a summary of the key functions covered:

- `pd.DataFrame()`: Creates a DataFrame from a dictionary or other data source
- `df.head()`: Displays the first few rows of a DataFrame
- `df.tail()`: Displays the last few rows of a DataFrame
- `df.info()`: Provides a concise summary of a DataFrame
- `df.describe()`: Generates descriptive statistics of a DataFrame
- `df.plot(kind = "line")`: Creates a line chart
- `df.plot(kind = "barh")`: Creates a horizontal bar chart
- `df.plot(kind = "hist")`: Creates a histogram
- `df.plot(kind = "box")`: Creates a box plot
- `df.plot(kind = "scatter")`: Creates a scatter plot

