# Notebook 1: Green Data Python Analysis – Foundations
### Welcome to Green Data!

### Topics covered: Python Setup, Data Cleaning, Data Wrangling, Data Visualization (basics)


## Getting Started with Python

#### Importing Packages

Packages contain useful functions that anyone can import to use on Python. These are the main ones we will be using:

In [None]:
import numpy as np  #for mathematical operations

import scipy as sc #for more advanced mathematical operations

import matplotlib.pyplot as plt  #for plotting data

import pandas as pd  #for data analysis

#### Importing Datasets

First, download the example dataset from our Green Data folder here: https://drive.google.com/drive/u/1/folders/1K5UGTtDnP4tCr5Y3hJNCDsuPKoV2JZFR

Make sure you save it somewhere you can find the path.

Define an object called 'path' with the path where you saved your dataset. It should look something like this:
"/Users/panchalichoudhary/Desktop/GD_exampledataset.csv"

Make sure you don't forget the quotation marks.

In [1]:
# Define the path to your dataset
path =

SyntaxError: invalid syntax (278054788.py, line 2)

Now, we will use the pandas function pd.read_csv() to import our dataset. Define an object called 'df', plugging your path into the function.

In [None]:
# Read in your dataset with pandas
# df = pd.read_csv(path)


Visualize your dataset.

In [2]:
# Display the first few rows
# df.head()

# Or view the entire dataset
# df or print(df)

## Introduction to Data Analysis 
#### Basic Functions

To check data types present in a dataset:

In [None]:
#use .dtypes

For a brief statistical summary:

In [None]:
#use .describe

For a full statistical summary (all columns, including non-numerical):

In [None]:
#modify .describe

For a concise summary of your dataset:

In [None]:
#use .info

## Cleaning Data

Sometimes datasets aren't perfect and there are missing values or duplicates. It's important to remove these values if we want to conduct data analysis that makes sense. Missing values show up as 'NaN' on Python, and we use np.nan in code to refer to them.

In [None]:
#check missing values using .isnull.sum()

In [None]:
#replace missing value with whatever you want using .replace


In [None]:
#or replace the missing value with the mean of that column. hint: use .mean


In [3]:
#to drop duplicates, use .drop_duplicates()

# Data Wrangling

## Data Formatting

You can apply calculations to entire columns. Let's say you want to convert Water Usage from gallons to liters:

In [None]:
#convert gallons to liters using conversion factor of 1 gal = 3.78541 L

In [None]:
#rename column using .rename

You can also convert an object's data type:

In [None]:
#use .astype

## Data Normalization

Sometimes we want to normalize data so that values are all within the same range. When we look at our dataset, for example, Energy Usage and Waste Generation are in different ranges. When we do data analysis with these two variables, Energy Usage can influence our results more since the numbers are "larger". To avoid this issue and eliminate data biases, we can normalize our data:

#### Simple Feature Scaling

In [None]:
df["Waste_Generation (kg/month) [Normalized]"] = df["Waste_Generation (kg/month)"]/df["Waste_Generation (kg/month)"].max()
#and we can do the same thing with Energy Usage

#### Min-Max

In [None]:
df["Waste_Generation (kg/month) [Normalized]"] = (df["Waste_Generation (kg/month)"]-df["Waste_Generation (kg/month)"].min())/(df["Waste_Generation (kg/month)"].max()-df["Waste_Generation (kg/month)"].min())

#### Z-Score

In [None]:
df["Waste_Generation (kg/month)"] = (df["Waste_Generation (kg/month)"]-df["Waste_Generation (kg/month)"].mean())/df["Waste_Generation (kg/month)"].std()

## Binning

Sometimes we want to convert continuous data into discrete data and group our data into small groups.

In [None]:
#define numberofbins:

#use np.linspace to set up bins:

In [None]:
#set bin names

In [None]:
#add bin columns to df using pd.cut

## Converting categorical variables into quantitative variables

Sometimes we want need our categorical variables to have some numeric value so that we can conduct data analysis. Let's say we wanted to see the effect of the type of building on energy usage (which should be a mathematical relationship). Type of building is a categorical variable, so we want to give it numeric value by converting it into a quantiative variable.

In [None]:
#assigns a dummy variable to each categorical variable using pd.get_dummies

In [None]:
#save a new dataframe with the new columns using .join

# Data Visualization (Basics)

Data visualizations help us see distributions and relationships in a dataset. 

### Histogram
Make a histogram of energy usage.

In [None]:
# use .hist()

### Scatterplot
Create a scatterplot of Energy Usage vs. Waste Generation

In [None]:
# use plt.scatter()

### Boxplot
Box plots are an easy way to demonstrate the statistics of a dataset. 

Make a boxplot of Waste Generation grouped by Building Type.

In [None]:
# use sns.boxplot