##### STA 141B Data & Web Technologies for Data Analysis

### Lecture 4, 1/15/26, Pandas


### Announcement

- __Midterm will be postponed from Februrary 10 to February 12, 7:30 - 8:50 am and take place in 198 Young.__
- On Tuesday 12, there will be a normal lecture from 10:30 - 11:50 am.
- On Thursday, the 'normal' lecture between 10:30 and 11:50 will be omitted (because of the midterm some hours before).
- Homework will be uploaded soon.

### Last lectures's topics

- Numpy

### Today's topics

 - Pandas
 
### Data Sets

 - `dogs_full.csv`
 - `fluidmilk.xlsx`

### References

 - Python for Data Analysis, Ch. 5, 10
 - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), Ch. 3

## Pandas

Pandas is a Python package that provides tools for manipulating tabular data. The name "pandas" is short for "**PAN**el **DA**ta", an econometrics term. Since we're using Anaconda, Pandas is already installed.

Pandas is documented [here](http://pandas.pydata.org/pandas-docs/stable/).

In [None]:
!pip install pandas

In [None]:
import pandas as pd

### Series

A Pandas Series a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). It is a generalization of a NumPy array. In addition to elements, every series includes an index.

In contrast to the numpy array, a pandas series MAY consist of <b>different data types.</b>

See [here](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) for an overview of data types.



In [None]:
import numpy as np

In [None]:
x = np.array([1,2,"a"])
print(x)

In [None]:
x = pd.Series([1, 2, 3])
type(x)

In [None]:
x

In [None]:
x.dtype

In [None]:
x.iloc[2] = "a"

In [None]:
print(x)

In [None]:
x.rename(index={1: 'one'}, inplace=False)

In [None]:
x

In [None]:
y

In [None]:
x

In [None]:
y = pd.Series(["a", "b", "c"], index = [3,2,1])
print(y)

In [None]:
y[1]

In [None]:
y.loc[1]

In [None]:
x = pd.Series([1,2,3], index = ["a", "b", "c"])
x

A series can be indexed in all of the same ways as a NumPy array, but also by index values. This means a series can also be used like an ordered dictionary (although its keys are not unique). 

In [None]:
x[0]

In [None]:
x[1:3]

In [None]:
x["a"]

In [None]:
x["a":"b"]

In [None]:
x["b":"a"]

In [None]:
x = pd.Series([1,2,3], index = ["c", "b", "a"])
x

In [None]:
x["a":"b"]

In [None]:
x["c":"b"]

In [None]:
x = pd.Series([1,2,3], index = ["c", "a", "b"])
x["c":"b"]

#### Uniqueness of indices
Note that indices need not be unique. However, if there are more indices, that could cause problems when selecting observations.

In [None]:
x = pd.Series([1,2,3], index = ["c", "a", "a"])
x["c":"a"]

In particular, you can get error messages if it is unclear what kind of slicing you were referring to:

In [None]:
x = pd.Series([1,2,3,4], index = ["c", "a", "b", "a"])
print(x)

In [None]:
x["c":"a"]

A series may have integer indices as well. 

In [None]:
x = pd.Series([1, 2, 3], index = [10, 12, 14])
x

For an indexing series (and as we'll see later, also data frames):

* `[ ]` is by position, name, or condition. **Exception:** for an integer index it's by name or condition only.
* `.iloc[ ]` is by position
* `.loc[ ]` is by name or condition

In [None]:
x.iloc[0] # first element of x

In [None]:
x.loc[10]

In [None]:
y = pd.Series([1,2,3])
print(y)
z = y[1:3]
z

In [None]:
z[1] # and not 3

In [None]:
z.iloc[1]

In [None]:
z.loc[1]

In [None]:
newy = pd.Series(["a","b","c"], index = [1,2,3])
print(newy)

In [None]:
newy[1] == "b" # it refers to the element with name "1"

In [None]:
newy = pd.Series([1,2,3,4], index = [1, "1", 2, 3])

In [None]:
print(newy) # it looks as if the indices were not unique!

In [None]:
newy[1]

In [None]:
newy["1"]

In [None]:
print(newy.index) # datatype is an object because it contains both strings and integers

In [None]:
# in contrast
newy = pd.Series([1,2,3], index = [1,1,2]) 

In [None]:
print(newy)

In [None]:
newy[1]

In [None]:
newy["1"] # gives a key error!

In [None]:
# better:
newy.get("1")

In [None]:
# or provide a default value
print(newy.get("1", "not found"))
newy.get("1") is None

In [None]:
z = pd.Series([1, 2, 3, 4], index = [3j, 2j, 1j, -1j])
z

In [None]:
z[3j]

In [None]:
z.loc[-1j]

In [None]:
z.index

In case you want to get the corresponding iloc location for an index name, you can use the `get_indexer` method.

In [None]:
z.index.get_indexer([1j, 2j])

In [None]:
# don't mix this up: 
# if you want to get all keys for a specific value, use
z.index[z == 2]

In [None]:
print(z)

In [None]:
print(z.isin([2,3]))

In [None]:
z.index[z.isin([2,3])]

### Data Frames

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

It represents tabular data as a collection of Series.

In [None]:
df = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df

In [None]:
df.T #transpose the dataframe

In [None]:
df

Data frames support the similar indexing methods as series. However, for indexing with `[ ]`,

* Scalar values get columns by name
* Conditions or slices get rows

Reminder: typically, each line corresponds to an observation, while each column corresponds to one feature. For example, every line could be a student and the first column contains the GPA average.

A condition would be to filter for all students having a GPA above 3.5.

In contrast, if we want to get a summary of the GPA average column, we would just select a <b>single</b> column from the data frame by using a scalar value.

In [None]:
df["y"]

In [None]:
df['x']

In [None]:
df[0:1]

In [None]:
print(df)

In [None]:
df.iloc[0,1]

In [None]:
df.iloc[0]

In [None]:
df.iloc[0:2]

In [None]:
df 

In [None]:
df["x"]

In [None]:
df.loc[0:1,"x"] 

In [None]:
df[df["x"] > 2] # more convenient 

In [None]:
df2 = df.loc[df['x']>2,:]  # more principled (?)

In [None]:
print(df2)

Since we are subsetting a DataFrame, a DataFrame is returned. 

In [None]:
df.shape

In [None]:
df.size

In [None]:
type(df)

In [None]:
df.dtypes

More indexing and selection data using pandas can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-slicing-with-labels)

Colummns that do not have a unique data type are called object. <b>Beware:</b> also strings are of data type object. 

In [None]:
dfnew = pd.DataFrame({"x": [1, "2"], "y": [3,4]})

In [None]:
print(dfnew)

In [None]:
dfnew.dtypes

In [None]:
dfnew = pd.DataFrame({"x": ["1", "2"], "y": [3,4]})

In [None]:
dfnew.dtypes

### Missing Data

Pandas represents missing data with `NaN` and `None`, but these values do not exclusively mean missing data. For instance, `NaN` stands for "Not a Number" and is also the result of undefined computations. Pay attention to your data and code to determine whether values are missing or have some other meaning.

You can create `NaN` values with NumPy.

In [None]:
import numpy as np 
np.nan

In [None]:
df = pd.DataFrame({"x": [1, np.nan, 2], "y": np.arange(3), "z": ["x", "y", None]})
df

Use the `.isna()` or `.isnull()` methods to detect missing values.

In [None]:
df.isna()

In [None]:
~df.isnull()

In [None]:
df

In [None]:
df.info()

In [None]:
df

In [None]:
(~df.isna()).sum().sum()

In [None]:
df.isna().sum().sum() # how many are not missing

In [None]:
df.sum().sum()

Lets deal with this warning. 

In [None]:
df.sum(numeric_only=True).sum()

### Data Alignment

Pandas supports vectorized operations, but elements are <b>automatically</b> aligned by index. **Beware!!** This is a major difference compared to R.

In [None]:
x = pd.Series([1, 2, 3], index = ["a", "b", "c"])
y = pd.Series([1, 2, 3], index = ["b", "a", "c"])
x

In [None]:
y

In [None]:
x * y

You can use the `.reset_index()` method to reset the indexes on a series or data frame. Watch out: The method returns a new DataFrame, but does not overwrite the old object. 

In [None]:
x.reset_index()# keep old index as a new column

In [None]:
df2 = df.reset_index()

In [None]:
df # does not overwrite the old object!

In [None]:
df2

In [None]:
y.reset_index(drop = True) # throw away the old index

### Reading Data

Pandas provides functions for reading (and writing) a variety of common formats. Most of their names begin with `read_`. For instance, we can read the dogs data from a CSV file:

In [None]:
dogs = pd.read_csv("./../data/dogs_full.csv") # change this file path accordingly

In [None]:
dogs.head()

### Inspecting Data

Series and data frames provide many of the same methods and attributes as NumPy arrays.

For a data frame, the `.dtypes` attribute gives the column types.

The type "object" means some non-numeric Python object, often a string.

In [None]:
dogs.dtypes

There are also several methods for quickly summarizing data.

In [None]:
dogs.describe()

First, get the string columns (`object`), then describe

In [None]:
dogs.select_dtypes(include = ["object"]).describe()

In [None]:
dogs.select_dtypes(include = ["int64"]).describe()

### Aggregation

Pandas also provides several methods for aggregating data, such as `.mean()`, `.median()`, `.std()`, and `.value_counts()`. They ignore missing values by default.

In [None]:
dogs.median(numeric_only=True)

In [None]:
dogs["weight"].median() # N

In [None]:
dogs["breed"].value_counts() # like R's table() with 1 arg

Wiki link to [Xoloitzcuintli](https://en.wikipedia.org/wiki/Xoloitzcuintle).

For counting one group against another (crosstabulating), use `pd.crosstab()`.

In [None]:
pd.crosstab(dogs["group"], dogs["kids"]) # like R's table() with 2+ arg

### Applying Functions

You can also use Pandas to apply your own aggregation functions to columns or rows.

* `.apply()` applies a function column-by-column or row-by-row.
* `.applymap()` applies a function element-by-element.

This is another way to vectorizing code, but only works for DataFrame. 


In [None]:
def spread(x):
    '''Returns spread. Input is a single column (or row)'''
    return x.max() - x.min()
    
dogs.select_dtypes(include = ["float64", "int64"]).apply(spread)

In contrast, we can also apply the function rowwise.

In [None]:
dogs.select_dtypes(include = ["float64", "int64"]).apply(spread, axis=1)

Coming up with a better example:

In [None]:
def annual_costs(d):
    return(d["lifetime_cost"]/d["longevity"])

ac = dogs.apply(annual_costs, axis=1)

In [None]:
print(ac)

In [None]:
ac.isnull().sum()

In [None]:
dogs["longevity"].isnull().sum()

In [None]:
sum(dogs["longevity"] == 0)

In [None]:
dogs["lifetime_cost"].isnull().sum()

Check whether lifetime costs are price + food_costs * age

In [None]:
temp = (dogs["lifetime_cost"]-dogs["price"])/(dogs["food_cost"]*dogs["longevity"])

In [None]:
temp.describe()

So the lifetime costs are not just price + food_costs*age

### Grouping

Use the `.groupby()` method to group data before computing aggregate statistics.

In [None]:
dogs.head()

In [None]:
dogs.groupby("group")

In [None]:
dogs.groupby("group").mean(numeric_only=True)

In [None]:
dogs.groupby("group").mean(numeric_only=True).reset_index()

By default, the groups become the index. You can keep them as regular columns by setting `as_index = False` when grouping.

In [None]:
dogs.groupby("group", as_index = False).mean(numeric_only=True)

You can group by multiple columns.

In [None]:
dogs.groupby(["group", "kids"]).mean(numeric_only=True)

On groups, the `.apply()` method computes group-by-group. It is the most general form of two other methods:

* `.agg()`, which applies a function to each group to compute summary statistics
* `.transform()`, which applies a function to each group to compute transformations (such as standardization)

## Tidying a Dataset

Do Americans prefer low fat milk over whole milk?

The USDA publishes data about dairy production. We can answer the question with the [Milk Sales Dataset](https://www.ers.usda.gov/webdocs/DataFiles/48685/fluidmilk.xlsx?v=5010.6).

Many of Python's visualization packages expect [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf), which means:

1. Each feature must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Let's tidy up the Milk Sales Dataset so we can make a line plot that shows how milk sales have changed over time.

In [None]:
import numpy as np
import pandas as pd

In [None]:
milk = pd.read_excel("../data/fluidmilk.xlsx")
milk.head()

In [None]:
milk = pd.read_excel("../data/fluidmilk.xlsx", skiprows = 1)
milk.head()

In [None]:
milk.columns

In [None]:
milk.index

In [None]:
milk.columns = milk.columns.str.replace('\n', '')
milk.head()

In [None]:
milk.columns

In [None]:
milk = milk.rename(columns=lambda x: x.strip(' 012')) # getting rid of the 0,1 and 2
milk.head()

In [None]:
milk = milk.rename(columns = {'Unnamed:': 'Year'})
milk.head()

In [None]:
milk.columns.values[[2,3,5,6]] = np.array(['Reduced', 'Low', 
                                            'Flavored Whole', 'Flavored Other'])

In [None]:
milk.head()

In [None]:
milk.dtypes

In [None]:
milk = milk.set_index('Year')  

In [None]:
milk.head()

In [None]:
milk.tail()

In [None]:
milk = milk[:-4]

Summarize everything:

In [None]:
def get_milk():
    milk = pd.read_excel("../data/fluidmilk.xlsx", skiprows = 1)
    milk.columns = milk.columns.str.replace('\n', '')
    milk = milk.rename(columns=lambda df: df.strip(' 12'))
    milk.columns.values[[0,2,3,5,6]] = np.array(['Year', 'Reduced', 'Low', 
                                                 'Flavored Whole', 'Flavored Other'])
    milk = milk[:-4] # get rid of the last four rows
    milk = milk.set_index("Year")
    return(milk)

In [None]:
milk = get_milk()
milk.head()

In [None]:
milk = milk.stack()
milk

In [None]:
milk.index

In [None]:
milk = milk.reset_index()
milk

In [None]:
milk.columns.values[[False, True, True]] = np.array(["Kind", "Sales"])

In [None]:
milk.head()

In [None]:
milk.tail()

In [None]:
!pip install plotnine

In [None]:
import plotnine as p9

p9.__version__

In [None]:
(
    p9.ggplot(milk, p9.aes(x = "Year", y = "Sales", color = "Kind")) 
    + p9.geom_line()
    + p9.labs(title = "US Milk Sales", y = "Sales (millions of pounds)")
)

In [None]:
milk = get_milk()

In [None]:
milk["Whole"]/milk["Total"]

In [None]:
def calc_share(x):
    return 100*x/x["Total"]

milk_share = milk.apply(calc_share, axis=1)

In [None]:
milk_share.head()

In [None]:
milk_share = milk_share.drop("Total", axis=1)
milk_share.head()

In [None]:
milk_share = milk_share.stack()
milk_share = milk_share.reset_index()

In [None]:
milk_share

In [None]:
milk_share.columns.values[[False, True, True]] = np.array(["Kind", "Sales"])

In [None]:
(
    p9.ggplot(milk_share, p9.aes(x = "Year", y = "Sales", color = "Kind")) 
    + p9.geom_line()
    + p9.labs(title = "US milk market share", y = "Share in %")
)