# Data Cleaning Studio

You are part of a team working for an agricultural nonprofit based in California. Your nonprofit focuses on helping farmers in the state grow profitable crops in environmentally-friendly ways in an effort to reduce the impact of climate change on the state and provide enough food for the state's 39.5 million residents. With Halloween on the horizon, you and your team are looking to analyze past pumpkin crops to try and answer several questions:
1. Are pumpkins sold at terminal markets in California (San Francisco and Los Angeles) grown in California?
1. Is the harvest season for pumpkins grown in California consistent year-to-year?
1. Are pumpkin farmers growing specific varieties of pumpkins for specific reasons?

The answers to these questions will help your nonprofit decide if they should promote specific varieties or growing practices to the farmers they serve in time for seeds to be planted next year. Your team has already performed some exploratory analysis on the San Francisco terminal market report of pumpkin sales from 9/2016-9/2017. Now it is time to clean the data!

---

## What You'll Learn

In this studio, you will practice:
- Identifying and handling missing data
- Removing unnecessary or irrelevant columns
- Dealing with inconsistent data formatting
- Making informed decisions about data cleaning strategies
- Documenting your reasoning for data cleaning choices

## How to Use This Notebook

This notebook is structured as a guided lab with questions and exercises. For each question:
1. **Read the instructions carefully** - They provide context and guidance
2. **Examine the code provided** - Some cells have working code for you to run and learn from
3. **Answer the reflection questions** - Write your thoughts in the designated answer cells
4. **Complete the tasks** - Some questions ask you to write your own code
5. **Run all cells in order** - The notebook builds on previous steps

---

## Column Reference Guide

Before diving in to cleaning the data, here is a quick guide to the different columns in the USDA report and what they mean:
- **Commodity Name:** This CSV structure is used for lots of USDA reports. In this case, the commodity is pumpkins
- **City Name:** City where the pumpkin was sold. The city is a terminal market location within the United States.
- **Type:** This refers to the type of farming used in growing the pumpkins
- **Package:** The way the pumpkins were packed for sale
- **Variety:** Specific type of pumpkin, i.e. pie pumpkin or a Howden pumpkin
- **Sub Variety:** Addition classifications about the pumpkins, i.e. is it a flat pumpkin?
- **Grade:** In the US, usually only canned pumpkin is graded
- **Date:** Date of sale (rounded up to the nearest Saturday)
- **Low Price:** This price is in reference to sale price
- **High Price:** This price is in reference to sale price
- **Mostly Low:** This column is not measured for pumpkins
- **Mostly High:** This column is not measured for pumpkins
- **Origin:** Which state the pumpkins were grown in
- **Origin District:** Additional information about pumpkins' origin location
- **Item Size:** Abbreviations denoting size, i.e. jbo = jumbo, lrg = large
- **Color:** Color of pumpkins
- **Environment:** Additional information about pumpkins' growing environment
- **Unit of Sale:** The unit the customer bought at market, i.e. if they bought pumpkins by the pound, the data should say "PER LB" or  if they bought pumpkins by the bin, it would say "PER BIN"
- **Quality:** Additional notes about pumpkin quality as necessary
- **Condition:** Additional notes about pumpkin condition as necessary
- **Appearance:** Additional notes about pumpkin appearance as necessary
- **Storage:** Additional notes about pumpkin storage as necessary
- **Crop:** Additional notes about pumpkin crop as necessary
- **Repack:** Whether the pumpkin has been repackaged before sale
- **Trans Mode:** Mode of transportation used to get pumpkins to terminal market

In [None]:
import pandas as pd
import matplotlib 
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("san-fransisco_9-24-2016_9-30-2017.csv")

In [None]:
data.head()

In [None]:
for col in data.columns:
    pct_missing = np.mean(data[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

In [None]:
# Checking the overall percentage of missing data from the data set

total_cells = np.product(data.shape)

missing_cells = pd.isnull(data).sum()

total_missing = missing_cells.sum()

percentage_missing = round((total_missing/total_cells), 2) *100

print(percentage_missing, "% Missing cells from the data")    

### Question 1: 

Look at the percentages of missing data per column. There are quite a few columns with less than 100% of the data.

**Instructions:**
For each column with missing data, decide what approach to take. Consider the following options:
1. **Drop the Observation** - Remove rows with missing values (best when few rows are affected)
2. **Drop the Feature** - Remove the entire column (best when most/all values are missing or the column isn't needed)
3. **Impute the Missing Values** - Fill in missing values using statistical methods (mean, median, mode)
4. **Replace the Missing Values** - Fill in missing values with a specific value that makes sense for your analysis

**Your Task:**
In the cell below, write down your decision for each column with missing data and explain your reasoning. Think about:
- How much data is missing?
- Is this column important for answering our research questions?
- What would be the most appropriate way to handle the missing values?

In [None]:
# Answer Question 1 here:




In [None]:
# Repack Column
data["Repack"] = data["Repack"].replace({"N":False})
#Check status with a quick head check
data.head(3)

# Question 2:

Look at the "Type" column. This column contains two values, "Organic" and "NaN".  

When it comes to food, produce is typically designated as either "Organic" or "Conventional" based on farming practices.  

**For the purposes of this analysis**, we will make the assumption that if a pumpkin is not explicitly labeled as "Organic", it was grown using conventional farming methods.

Do you think this is a reasonable assumption to make? What are the potential risks or limitations of replacing NaN values with "Conventional"? Consider: Could NaN mean something else, like "unknown" or missing data?

In [None]:
# Answers Question 2:




In [None]:
# Fill the Missing Values in the Type Column
data["Type"] = data["Type"].fillna("Conventional")

# Question 3

Based on the information provided by our team, "Grade" is only applied to canned pumpkin. These were all uncanned, whole pumpkins. This column is irrelevant to the dataset.

**Instructions:**
We are going to drop the "Grade" column using the code in the cell below.

**Your Task:**
Before running the code, answer these questions:
- Do you agree with the decision to drop this column? Why or why not?
- What makes a column "irrelevant" or "unnecessary" for a dataset?
- Can you think of other situations where you might want to drop an entire column?

In [None]:
# Answer Question 3 here:


In [None]:
# Grade Column 
data = data.drop(["Grade"], axis=1) 

In [None]:
#Check with shape
data.shape

# Question 4

Now it's your turn! Based on what you've learned, decide which other columns are relevant or irrelevant to our analysis.

**Reminder - Our Research Questions:**
1. Are pumpkins sold at terminal markets in California grown in California?
2. Is the harvest season for pumpkins grown in California consistent year-to-year?
3. Are pumpkin farmers growing specific varieties of pumpkins for specific reasons?

**Your Task:**
- Review the column descriptions at the top of this notebook
- Look at the missing data percentages from earlier
- Decide which columns you need to answer the research questions above
- Consider: Are some columns empty because the information doesn't apply to these pumpkins, or because the data wasn't collected?

**Questions to Answer:**
- Which columns do you think are empty for a reason? What reason?
- Is this intentional (the data doesn't apply) or unintentional (the data should be there but is missing)?
- Which columns should we drop, and why?

In [None]:
# Explain your rationale briefly here: 


In [None]:
# Drop the ones you decide are irrelevant using the code we used to drop the "Grade Column"

# Question 5:

Now let's examine the distribution of sales across different dates. The code below creates two visualizations:
1. A bar chart showing the count of sales transactions per date
2. A histogram showing the distribution of dates

**Instructions:**
- Run both visualization cells below
- Examine the patterns you see in the data

**Questions to consider:**
- What do you notice about the distribution of sales across time?
- Are there any dates with unusually high or low numbers of transactions?
- Does the data appear to be evenly distributed across the year, or are there seasonal patterns?

In [None]:
# Answer 5 here:



In [None]:
# Bar chart - distribution of sales transactions per date
# This shows how many sales occurred on each date
data['Date'].value_counts().plot.bar()

In [None]:
# Histogram showing the distribution of dates
data['Date'].hist(bins=100)
plt.xticks(rotation=90)

# Question 6:

Now let's think about data consistency. Inconsistent data often comes from inconsistent formatting - for example, dates written in different formats (10/01/2016 vs. October 1, 2016), or categories with different capitalization (ORGANIC vs. Organic vs. organic).

**Instructions:**
- Review the dataset columns, especially those with text/categorical data
- Look at the sample data shown in the `data.head()` output from earlier
- Think about the columns that might have formatting inconsistencies

**Your Task:**
Answer these questions:
- Are there any columns in this dataset where you're concerned about inconsistencies?
- Which columns would you want to check for formatting issues?
- What specific inconsistencies might you look for? (Examples: capitalization, spacing, abbreviations vs. full words, date formats)
- How would you check for these inconsistencies? (Hint: methods like `value_counts()`, `unique()`, or `str` methods might help)

In [None]:
# Answer Question 6 here: 

