## Milestone Project / The Data Incubator Bootcamp

### Analysis of the Value of Energy Cost Savings Program Savings for Businesses in New York City


Import the python packages:

In [None]:
import pandas as pd
import numpy as np
import datetime as datetime
import matplotlib as mp
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from datetime import date

Read data from a CSV file to create a DataFrame:

In [None]:
df = pd.read_csv("datasets/Value_of_Energy_Cost_Savings_Program_Savings_for_Businesses_-_FY2020.csv")
                 

General information about the DataFrame:

In [None]:
df.info()

How many different companies are represented in the data set?

In [None]:
print("Number of unique companies:", df['Company Name'].nunique())

What is the total number of jobs created for businesses in Queens?

In [None]:
job_created = df.dropna(subset=["Job created"])
queens = job_created.groupby("Borough")
print("Jobs created in Queens:")
print(queens.get_group("Queens")["Job created"].sum())

How many different unique email domains names are there in the data set?

In [None]:
email = df[["company email"]].dropna()
email["domain"] = email["company email"].str.split('@').str[1]
print("Number of unique email domains:", email["domain"].str.lower().nunique()

Considering only NTAs with at least 5 listed businesses, what is the average total savings and the total jobs created for each NTA?

In [None]:
nta_5 = by_nta.groupby(level=0).filter(lambda x: len(x) >=5)
nta_5_stats=nta_5.groupby("NTA").agg(
    **{"Average Total Savings":pd.NamedAgg(column="Total Savings", aggfunc="mean"),
       "Total Jobs Created":pd.NamedAgg(column="Job created", aggfunc="sum"))
    }
)
nta_5_stats.style.format({'Average Total Savings': '${0:,.2f}', 'Total Reported Jobs Created': '{}'})

Save your result for the previous question as a CSV file.

In [None]:
filepath = "datasets/nta_stats.csv"
nta_5_stats.to_csv(filepath)

Using the same data set and results, create a scatter plot of jobs created versus average savings. Use both a standard and a logarithmic scale for the average savings.

In [None]:
df = pd.read_csv("datasets/nta_stats.csv")
jobs_created = df["Total Jobs Created"]
avg_savings = df["Average Total Savings"]

fig1, axs = plt.subplots(2, sharex=True, figsize=[8, 9.6])
lin_scatter=axs[0]
log_scatter=axs[1]

fig1.suptitle("Jobs Created vs. Average Total Savings")
lin_scatter.xaxis.set_major_locator(ticker.MultipleLocator(5))
lin_scatter.xaxis.set_minor_locator(ticker.MultipleLocator(1))

lin_scatter.set_title("Linear Scale")
lin_scatter.scatter(jobs_created, avg_savings)

log_scatter.set_yscale('log')
log_scatter.set_title("Logarithmic Scale")
log_scatter.scatter(jobs_created, avg_savings)
log_scatter.yaxis.set_major_formatter(tick)
log_scatter.set_xlabel("Reported Jobs Created")

Create a histogram of the log of the average total savings.

In [None]:
hist = plt.subplot()
hist.set_title("Histogram of the Log of the Average Total Savings")
hist.set_ylabel("Number of NTAs")
hist.set_xscale('log')
hist.grid(axis="y")
logbins = np.geomspace(avg_savings.min(), avg_savings.max(), 10)

Create a line plot of the total jobs created for each month.

In [None]:
jobs_by_month = jobs_created_timeseries.resample('M').sum()
fig = plt.figure(figsize = (10,4))
plt.plot(jobs_by_month)
plt.set_title("Jobs Created by Month")
plt.show()