# M1L5 More EDA with Pandas 

This notebook will guide you through some essential data manipulation techniques using the Pandas library in Python. We'll be working with the Austin Animal Center Intakes dataset, which contains information about animals entering the Austin Animal Center.

### **Dataset:** [Austin Animal Center Intakes](https://catalog.data.gov/dataset/austin-animal-center-intakes) -- This is also in your data folder 

### **Objectives:**

 1.  Load and explore the dataset.
 2.  Use `groupby()` to aggregate data.
 3.  Create contingency tables using `crosstab()`.
 4.  Identify and handle duplicate entries.

## Step 1:  Import pandas and numpy 

In [1]:
#Import packages 

import pandas as pd
import numpy as np

## Step 2:  Load in the data and save it as `df`

In [2]:
df = pd.read_csv('/Users/Marcy_Student/Desktop/marcy/marcy-git/DA2025_Lectures/Mod1/CSVs/Austin_Animal_Center_Intakes__10_01_2013_to_05_05_2025_.csv')

## Step 3:  Look at the data (can you think of some methods to do this)

In [3]:
df

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A521520,Nina,10/01/2013 07:51:00 AM,October 2013,Norht Ec in Austin (TX),Stray,Normal,Dog,Spayed Female,7 years,Border Terrier/Border Collie,White/Tan
1,A664235,,10/01/2013 08:33:00 AM,October 2013,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White
2,A664236,,10/01/2013 08:33:00 AM,October 2013,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White
3,A664237,,10/01/2013 08:33:00 AM,October 2013,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White
4,A664233,Stevie,10/01/2013 08:53:00 AM,October 2013,7405 Springtime in Austin (TX),Stray,Injured,Dog,Intact Female,3 years,Pit Bull Mix,Blue/White
...,...,...,...,...,...,...,...,...,...,...,...,...
173807,A929690,,05/03/2025 11:18:00 PM,May 2025,8038 Exchange Dr in Austin (TX),Stray,Injured,Dog,Intact Male,2 years,Belgian Malinois,Brown/Black
173808,A929717,,05/04/2025 03:14:00 PM,May 2025,Austin (TX),Public Assist,Normal,Dog,Intact Male,1 year,Shih Tzu Mix,White/Blue
173809,A929724,,05/04/2025 07:43:00 PM,May 2025,7105 Providence Ave Apt 3 in Austin (TX),Stray,Normal,Other,Unknown,1 year,Rabbit Sh,Tan/White
173810,A929725,Oswold,05/04/2025 10:55:00 PM,May 2025,1501 Red River St in Austin (TX),Public Assist,Normal,Dog,Intact Male,10 years,Boxer Mix,Tan/White


## Step 4:  Count up how many missing values exist in each column (you would need to chain two methods here -- one to check for missing values and the other to sum missing values up)

In [31]:
df.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Found Location',
       'Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake',
       'Age upon Intake', 'Breed', 'Color'],
      dtype='object')

In [57]:
for col in df.columns:
    print(pd.value_counts(df[col].isnull()))
    print('')

Animal ID
False    173812
Name: count, dtype: int64

Name
False    123821
True      49991
Name: count, dtype: int64

DateTime
False    173812
Name: count, dtype: int64

MonthYear
False    173812
Name: count, dtype: int64

Found Location
False    173812
Name: count, dtype: int64

Intake Type
False    173812
Name: count, dtype: int64

Intake Condition
False    173812
Name: count, dtype: int64

Animal Type
False    173812
Name: count, dtype: int64

Sex upon Intake
False    173811
True          1
Name: count, dtype: int64

Age upon Intake
False    173812
Name: count, dtype: int64

Breed
False    173812
Name: count, dtype: int64

Color
False    173812
Name: count, dtype: int64



  print(pd.value_counts(df[col].isnull()))


## Step 5:  Count up the amount of animals by Animal Type 

In [35]:
animal_counts = pd.value_counts(df['Animal Type'])
print(animal_counts)

Animal Type
Dog          94608
Cat          69324
Other         8968
Bird           878
Livestock       34
Name: count, dtype: int64


  animal_counts = pd.value_counts(df['Animal Type'])


## Step 6:  Create a crosstab showing the count of animal types for each intake condition.

In [36]:
cross_table = pd.crosstab(df['Animal Type'], df['Intake Condition'])
print(cross_table)

Intake Condition  Aged  Agonal  Behavior  Congenital  Feral  Injured  \
Animal Type                                                            
Bird                 0       0         0           0      0      249   
Cat                 77       3         8           0    133     4339   
Dog                445       1        73           1     11     5000   
Livestock            0       0         0           0      0        2   
Other                3       0         0           0      1     1215   

Intake Condition  Med Attn  Med Urgent  Medical  Neonatal  Neurologic  Normal  \
Animal Type                                                                     
Bird                     0           0        0         1           0     603   
Cat                     27          11      213      1467           5   57032   
Dog                     60          10      391       476           6   83773   
Livestock                0           0        0         1           0      28   
Other    

## Step 7:  Check for duplicate Animal IDs (pay close attention to the syntax here)

In [43]:
duplicate_ids = pd.value_counts(df['Animal ID'].duplicated())
print(duplicate_ids)

Animal ID
False    156287
True      17525
Name: count, dtype: int64


  duplicate_ids = pd.value_counts(df['Animal ID'].duplicated())


## Practice Joining Data 

### Scenario 1
You have customer data split into two different files (DataFrames),
and you want to combine them into a single DataFrame for analysis.

In [44]:
# Run the cell without changes to create the two dataframes 

customers_part1_df = pd.DataFrame({'CustomerID': [101, 102, 103],
                                   'FirstName': ['Alice', 'Bob', 'Charlie'],
                                   'City': ['Anytown', 'Otherville', 'Smallburg']})

customers_part2_df = pd.DataFrame({'CustomerID': [104, 105, 106],
                                   'FirstName': ['David', 'Emily', 'Frank'],
                                   'City': ['Bigcity', 'Townsville', 'Villageton']})

### Scenario 1 Task:  Use `pd.concat()` to stack the two dataframes above (afterall they have the same columns)

In [46]:
all_customers_df = pd.concat([customers_part1_df, customers_part2_df], ignore_index=True)
all_customers_df

Unnamed: 0,CustomerID,FirstName,City
0,101,Alice,Anytown
1,102,Bob,Otherville
2,103,Charlie,Smallburg
3,104,David,Bigcity
4,105,Emily,Townsville
5,106,Frank,Villageton


### Scenario 2

Combining customer details and loyalty points.

In [47]:
# Run the cell without changes to create the two dataframes 

customer_details_df = pd.DataFrame({'CustomerID': [101, 102, 103],
                                    'Name': ['Alice', 'Bob', 'Charlie'],
                                    'City': ['Anytown', 'Otherville', 'Smallburg']})

loyalty_points_df = pd.DataFrame({'CustomerID': [101, 102, 103],
                                  'Points': [100, 250, 50]})

### Scenario 2 Task :  Merge the DataFrames on CustomerID


In [52]:
merged_customer_df = pd.merge(customer_details_df, loyalty_points_df, how='right')
merged_customer_df

Unnamed: 0,CustomerID,Name,City,Points
0,101,Alice,Anytown,100
1,102,Bob,Otherville,250
2,103,Charlie,Smallburg,50
