# Notebook 1: Green Data Python Analysis – Foundations
### Welcome to Green Data!

### Topics covered: Python Setup, Data Cleaning, Data Wrangling, Data Visualization (basics), Time Series (basics)


## Getting Started with Python

#### Importing Packages

Packages contain useful functions that anyone can import to use on Python. These are the main ones we will be using:

In [1]:
import numpy as np  #for mathematical operations

import scipy as sc #for more advanced mathematical operations

import matplotlib.pyplot as plt  #for plotting data

import pandas as pd  #for data analysis

#### Importing Datasets

First, download the example dataset from our Green Data folder here: https://drive.google.com/drive/u/1/folders/1K5UGTtDnP4tCr5Y3hJNCDsuPKoV2JZFR

Make sure you save it somewhere you can find the path.

Define an object called 'path' with the path where you saved your dataset. It should look something like this:
"/Users/panchalichoudhary/Desktop/GD_exampledataset.csv"

Make sure you don't forget the quotation marks.

In [2]:
# Define the path to your dataset
path = 'GD_exampledataset.csv'

Now, we will use the pandas function pd.read_csv() to import our dataset. Define an object called 'df', plugging your path into the function.

In [3]:
# Read in your dataset with pandas
df = pd.read_csv(path)

Visualize your dataset.

In [31]:
# Display the first few rows
df.head()

# Or view the entire dataset
df

Unnamed: 0,Building_Name,Energy_Usage (kWh),Water_Usage (gal),GHG_Emissions (kg),Waste_Generation (kg/month),Type_of_Building,Waste_Generation (kg/month) [Normalized]
0,Eco Tower,45000,28000.0,6000,-0.771319,Residential,0.264706
1,Green Office,32000,18000.0,4800,-1.557615,Comercial,0.058824
2,Solaris Mall,75000,42000.0,10500,2.036881,Comercial,1.0
3,Sustainable HQ,56000,32000.0,7000,-0.209679,Comercial,0.411765
4,Eco Haven,39000,,5800,-0.995975,School,0.205882
5,Eco Dynamics,42000,24000.0,6500,-0.546663,Residential,0.323529
6,Green Plaza,30000,17000.0,4500,-1.78227,Comercial,0.0
7,Solar Heights,68000,38000.0,9500,1.475241,Residential,0.852941
8,Terra Green,59000,33000.0,7200,0.014977,School,0.470588
9,BioTech Tower,48000,27000.0,6000,-0.771319,School,0.264706


## Introduction to Data Analysis 
#### Basic Functions

To check data types present in a dataset:

In [5]:
#use .dtypes

For a brief statistical summary:

In [32]:
df.describe

<bound method NDFrame.describe of         Building_Name  Energy_Usage (kWh)  Water_Usage (gal)  \
0           Eco Tower               45000            28000.0   
1        Green Office               32000            18000.0   
2        Solaris Mall               75000            42000.0   
3      Sustainable HQ               56000            32000.0   
4           Eco Haven               39000                NaN   
5        Eco Dynamics               42000            24000.0   
6         Green Plaza               30000            17000.0   
7       Solar Heights               68000            38000.0   
8         Terra Green               59000            33000.0   
9       BioTech Tower               48000            27000.0   
10    Eco Innovations               44000            25000.0   
11        Aqua Center               67000            37000.0   
12       Harmony Hall               51000            29000.0   
13      Green Horizon               43000            24000.0   
14  So

For a full statistical summary (all columns, including non-numerical):

In [33]:
df.describe

<bound method NDFrame.describe of         Building_Name  Energy_Usage (kWh)  Water_Usage (gal)  \
0           Eco Tower               45000            28000.0   
1        Green Office               32000            18000.0   
2        Solaris Mall               75000            42000.0   
3      Sustainable HQ               56000            32000.0   
4           Eco Haven               39000                NaN   
5        Eco Dynamics               42000            24000.0   
6         Green Plaza               30000            17000.0   
7       Solar Heights               68000            38000.0   
8         Terra Green               59000            33000.0   
9       BioTech Tower               48000            27000.0   
10    Eco Innovations               44000            25000.0   
11        Aqua Center               67000            37000.0   
12       Harmony Hall               51000            29000.0   
13      Green Horizon               43000            24000.0   
14  So

For a concise summary of your dataset:

In [34]:
df.info

<bound method DataFrame.info of         Building_Name  Energy_Usage (kWh)  Water_Usage (gal)  \
0           Eco Tower               45000            28000.0   
1        Green Office               32000            18000.0   
2        Solaris Mall               75000            42000.0   
3      Sustainable HQ               56000            32000.0   
4           Eco Haven               39000                NaN   
5        Eco Dynamics               42000            24000.0   
6         Green Plaza               30000            17000.0   
7       Solar Heights               68000            38000.0   
8         Terra Green               59000            33000.0   
9       BioTech Tower               48000            27000.0   
10    Eco Innovations               44000            25000.0   
11        Aqua Center               67000            37000.0   
12       Harmony Hall               51000            29000.0   
13      Green Horizon               43000            24000.0   
14  Sola

## Cleaning Data

Sometimes datasets aren't perfect and there are missing values or duplicates. It's important to remove these values if we want to conduct data analysis that makes sense. Missing values show up as 'NaN' on Python, and we use np.nan in code to refer to them.

In [35]:
df.isnull.sum()

AttributeError: 'function' object has no attribute 'sum'

In [10]:
#replace missing value with whatever you want using .replace


In [11]:
#or replace the missing value with the mean of that column. hint: use .mean


In [12]:
#to drop duplicates, use .drop_duplicates()

# Data Wrangling

## Data Formatting

You can apply calculations to entire columns. Let's say you want to convert Water Usage from gallons to liters:

In [13]:
#convert gallons to liters using conversion factor of 1 gal = 3.78541 L

In [14]:
#rename column using .rename

You can also convert an object's data type:

In [15]:
#use .astype

## Data Normalization

Sometimes we want to normalize data so that values are all within the same range. When we look at our dataset, for example, Energy Usage and Waste Generation are in different ranges. When we do data analysis with these two variables, Energy Usage can influence our results more since the numbers are "larger". To avoid this issue and eliminate data biases, we can normalize our data:

#### Simple Feature Scaling

In [16]:
df["Waste_Generation (kg/month) [Normalized]"] = df["Waste_Generation (kg/month)"]/df["Waste_Generation (kg/month)"].max()
#and we can do the same thing with Energy Usage

#### Min-Max

In [17]:
df["Waste_Generation (kg/month) [Normalized]"] = (df["Waste_Generation (kg/month)"]-df["Waste_Generation (kg/month)"].min())/(df["Waste_Generation (kg/month)"].max()-df["Waste_Generation (kg/month)"].min())

#### Z-Score

In [18]:
df["Waste_Generation (kg/month)"] = (df["Waste_Generation (kg/month)"]-df["Waste_Generation (kg/month)"].mean())/df["Waste_Generation (kg/month)"].std()

## Binning

Sometimes we want to convert continuous data into discrete data and group our data into small groups.

In [19]:
#define numberofbins:

#use np.linspace to set up bins:

In [20]:
#set bin names

In [21]:
#add bin columns to df using pd.cut

## Converting categorical variables into quantitative variables

Sometimes we want need our categorical variables to have some numeric value so that we can conduct data analysis. Let's say we wanted to see the effect of the type of building on energy usage (which should be a mathematical relationship). Type of building is a categorical variable, so we want to give it numeric value by converting it into a quantiative variable.

In [22]:
#assigns a dummy variable to each categorical variable using pd.get_dummies

In [23]:
#save a new dataframe with the new columns using .join

# Data Visualization (Basics)

Data visualizations help us see distributions and relationships in a dataset. 

### Histogram
Make a histogram of energy usage.

In [24]:
# use .hist()

### Scatterplot
Create a scatterplot of Energy Usage vs. Waste Generation

In [25]:
# use plt.scatter()

### Boxplot
Box plots are an easy way to demonstrate the statistics of a dataset. 

Make a boxplot of Waste Generation grouped by Building Type.

In [26]:
# use sns.boxplot

## Basics of Time Series Analysis
Time series analysis involves analyzing time-ordered data to identify trends, seasonal patterns, and other characteristics to extract insights and make future predictions.

First, let's practice importing a dataset again. Download the Mexico City Air Quality Data from our training resources folder and load it into the notebook.

Now let's explore the dataset using some of the functions we learned before.

In [27]:
#hint: use .head(), .info(), and .describe()


Look for what each column represents. Which one is time? Which are pollutants? Also, are all the columns numeric?

#### Converting Data to Datetime Form

Before we can do any time series analysis, we need to clean the data and convert the timestamp into datetime form.

In [28]:
## Convert the timestamp column to datetime
#df['Timestamp'] = pd.to_datetime(df['Timestamp'], dayfirst=True)

# Set as index
#df = df.set_index('Timestamp')

# Check the new index
#df.index

Notice the day/month order — dayfirst=True helps parse correctly.

#### Handle Missing Data
We also need to check for missing values and decide how to handle them. Of the two options, try either.

In [29]:
# Check for missing values
#df.isna().sum()

# Option 1: Drop missing rows
#df_clean = df.dropna()

# Option 2: Fill missing values using interpolation
#df_clean = df.interpolate(method='time')

Note: Interpolation is better for time series because it estimates based on trends rather than removing rows.

#### Visualizing Time Series

We want to visualize the air pollutant levels provided by the dataset, so let's plot pollutant trends over time.

In [30]:
#df_clean[['PM2.5 [ug/m3]', 'PM10[ug/m3]', 'Ozone [ppb]']].plot(figsize=(10,5))
#plt.title("Air Pollutant Levels Over Time")
#plt.xlabel("Time")
#plt.ylabel("Concentration")
#plt.show()

Hint: Start with a small sample (e.g. one month) if your plot of all the data looks crowded.