# Visualization

## Data Visualization with `matplotlib`

This section was written by Weijia Wu.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
url = 'https://raw.githubusercontent.com/JoannaWuWeijia/Data_Store_WWJ/main/cleaning_data_rodent3.csv'

df = pd.read_csv(url)

### Introduction
Hi Class, my name is Weijia Wu and I'm a senior double majored in Applied Math and Statistics. 
The following shows a basic concepts of visulization of python.


### Matplotlib
Matplotlib is a desktop plotting package designed for plotting 
and arranging data visually in Python, usually in two-dimensional. 
It was created by Dr. John Hunter in 2003 as an alternative to Matlab to facilitate 
scientific computation and data visualization in Python.

Matplotlib is widely used because of its simplicity and effectiveness.

#### Installation of `Matplotlib`
The library can be installed by typing `pip install matplotlib` in your terminal

```
pip install matplotlib
```

#### Line Plot 

##### Single plot with `pyplot` submodule

Let's Start with an sample Line Plot example: 

In [None]:
t = range(0, 10) 
r = [i**2 for i in t]

plt.figure(figsize=(4, 4)) 
## Width and height in inches
plt.plot(t, r)
plt.title('Line Plot Example')

plt.show()

##### x-label, y-label, and grid:

In [None]:
plt.figure(figsize=(4, 4)) 

plt.plot(t, r)
plt.title('Line Plot Example2')
plt.xlabel('t value')
plt.ylabel('r value')
plt.grid(True)

##### Add legend:

In [None]:
plt.figure(figsize=(4, 4)) 

plt.plot(t, r)
plt.title('Line Plot Example3')
plt.xlabel('t value')
plt.ylabel('r value')
plt.grid(True)
plt.legend()

To add a legend to a plot in Matplotlib, you can use the `legend()` function. 

A legend is a small area on the plot that describes each element of the graph. 

To effectively use the legend, you typically need to label the elements 
of the plot that you want to appear in the legend using the label parameter when plotting them. 

In [None]:
plt.legend(loc='lower right', title='Legend Title', fontsize='small')

The `help(plt.legend)` command in Python is used to display the documentation 
for the legend function from the Matplotlib library. This documentation 
includes a description of what the function does, the parameters it accepts, 
and other relevant information such as return values and examples of how to use the function.

In [None]:
help(plt.legend)

##### Colors, Markers, and Line Styles

If we want two plots in the same, we need to find a way to make the distinction between them.

In [None]:
r2 = [i**3 for i in t]

plt.figure(figsize=(4, 4)) 

plt.plot(t, r, linestyle = '--', color = 'r', marker = 'o', label = 'r')
plt.plot(t, r2, linestyle = '-', color = 'b', marker = 'v', label = 'r2')

plt.title('Line Plot Example2')
plt.xlabel('t value')
plt.ylabel('r value')
plt.grid(True)

plt.show()

Use  `linestyle`, `color`, and `Markers` to set linestyles:

In [None]:
## help(plt.plot)

### Example with rodent data: 

Let's use our rodent data to demonstrate the Monthly Reported data: 

In [None]:
df['Created Date'] = pd.to_datetime(df['Created Date'])

df['Month'] = df['Created Date'].dt.to_period('M')
monthly_counts = df.groupby('Month').size()

plt.figure(figsize=(10, 8))
monthly_counts.plot(kind='line')
plt.title('Monthly Report Count')
plt.xlabel('Month')
plt.ylabel('Number of Reports')
plt.grid(True)
plt.xticks(rotation=45)

plt.show()

This plot shows the number of rodents in each month's report,
 and we can draw the following conclusions: rodent sights occur mostly in
  the spring and summer, and they fall dramatically after the start of autumn (post-August).


#### Scatter plot

In [None]:
np.random.seed(8465);

x = np.random.uniform(0, 3, 10);
y = np.random.uniform(0, 3, 10);
z = np.random.uniform(0, 3, 10);

plt.scatter(x, y)
plt.xlabel('X')
plt.ylabel('Y')

plt.show()

#### Bar Plot

In [None]:
borough_counts = df['Borough'].value_counts()

plt.figure(figsize=(8, 6))  
plt.bar(borough_counts.index, borough_counts.values, color='green')
plt.xlabel('Borough')  
plt.ylabel('Number of Rodent Sightings')  
plt.title('Rodent Sightings by Borough') 
plt.xticks(rotation=45)  # Rotate the X axis by 45 degrees to show the long labels

plt.show()

#### Multiple plots using `subplots` submodule

In [None]:
df['Created Date'] = pd.to_datetime(df['Created Date'])
df['Date'] = df['Created Date'].dt.date
daily_reports = df.groupby(['Date', 'Incident Zip']).size().reset_index(name='Counts')
sample_zip = daily_reports['Incident Zip'].dropna().iloc[0]
sample_data = daily_reports[daily_reports['Incident Zip'] == sample_zip]

## 2x2 Plot
fig, axs = plt.subplots(2, 2, figsize=(8, 8))

## Line Plot
axs[0, 0].plot(sample_data['Date'], sample_data['Counts'], '-o', color='green')
axs[0, 0].set_title(f'Linear Plot of Reports for Zip {sample_zip}')
axs[0, 0].tick_params(labelrotation=45)

## Box Plot
axs[0, 1].boxplot(df['Y Coordinate (State Plane)'].dropna())
axs[0, 1].set_title('Boxplot of Y Coordinate')

## barplot
status_counts = df['Status'].value_counts()
axs[1, 0].bar(status_counts.index, status_counts.values, color='skyblue')
axs[1, 0].set_title('Barplot of Status Counts')
axs[1, 0].tick_params(labelrotation=45)

## histogram
axs[1, 1].hist(df['Latitude'].dropna(), bins=30, color='orange')
axs[1, 1].set_title('Histogram of Latitude')

plt.tight_layout()
plt.show()

#### Save the files

`help(plt.savefig)`allows you to save the current figure created by
 Matplotlib to a file. You can specify the filename and various 
 options to control the format, quality, and layout of the output file.


In [None]:
## help(plt.savefig)

### Pandas

Pandas plotting is built on top of Matplotlib, and one of its main 
benefits is that it allows you to generate plots with fewer lines of 
code directly from Pandas data structures like DataFrames and Series. 
This integration simplifies the process of visualizing data for analysis.


#### Line Plot

##### Single plot

In [None]:
monthly_counts.plot(kind='line')

Because the line plot is default in pandas plots, you can omit the (kind='line')

When plotting with the .plot() method in Pandas, it is true that you can 
generate basic plots with fewer lines of code, due to the fact that Pandas
 automatically handles some of the basic settings, such as setting the 
 x-axis labels automatically. However, for more detailed chart customization, 
 such as setting gridlines, rotating x-axis labels, and so on, 
 you may need additional Matplotlib commands to implement them.


In [None]:
plt.figure(figsize=(8, 6))
monthly_counts.plot(kind='line')

plt.title('Monthly Report Count')
plt.xlabel('Month')
plt.ylabel('Number of Reports')
plt.grid(True)
plt.xticks(rotation=45)
## For longer tags, avoid overlapping

plt.show()

##### Multi-Lineplot
The following is showing several line plots in the same figure. 

In [None]:
community_counts = df['Community Districts'].value_counts().sort_index()
city_council_counts = df['City Council Districts'].value_counts().sort_index()
police_precincts_counts = df['Police Precincts'].value_counts().sort_index()

counts_df = pd.DataFrame({
    'Community Districts': community_counts,
    'City Council Districts': city_council_counts,
    'Police Precincts': police_precincts_counts
})
counts_df = counts_df.fillna(0) 
##Fill missing values to 0

counts_df[['Community Districts', 'City Council Districts', 
'Police Precincts']].plot() 

When you use the .plot() method on a Pandas DataFrame to create a multi-line plot,
 each line in the plot is automatically assigned a different color to 
 help distinguish between the different data columns visually. 
 The colors are chosen from a default color cycle provided by Matplotlib. 

If you want to customize the color: 

In [None]:
counts_df[['Community Districts', 'City Council Districts', 'Police Precincts']].plot(
    color=['red', 'green', 'blue']  # Custom colors for each line
)

#### Additional arguments

For more info pleased check:

In [None]:
![additional arguments](https://drive.google.com/file/d/1j5T7_VMT1Nt4myukcmar0UMcZOHqurCk/view?usp=sharing)

In [None]:
## help(plt.plot)

#### Bar Plot

For categorical data, one of common visualization is the barplot.

+ Generated using `df.plot.bar()` method,
for horizontal version `df.plot.barh()`.

##### Side-by-side Bar Plot:

Let's use Borough and Location Type to generate a side-by-side bar plot, one horizontal and one vertical:

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(8, 6))

## Vertical bar plot for Borough counts
df.groupby(['Borough']).size().plot.bar(ax=axs[0], color='skyblue', rot=0)
axs[0].set_title('Bar plot for Borough')

## Horizontal bar plot for Location Type counts
df.groupby(['Location Type']).size().plot.barh(ax=axs[1], color='lightgreen')
axs[1].set_title('Bar plot for Location Type')


plt.tight_layout()
plt.show()

Similiar with `axs` in matplotlib:

+ `nrows=1` means there will be 1 row of subplots.
+ `ncols=2 `means there will be 2 columns of subplots.

##### Grouped Bar Plot
This type of plot is useful for comparing the distribution within each class side by side.

In [None]:
class_Borough = pd.crosstab(df["Borough"], df["Status"])

class_Borough.plot.bar(rot=45, figsize=(8, 6))

##### Stacked Bar Plot
This plot is useful for comparing the total counts across 
borough while still being able to see the proportion of each borough within each class.

In [None]:
class_Borough.plot.bar(stacked=True)

#### Histogram and Density Plots

For numeric data, histogram allows us to see the distribution (center shape, skewness) of the data.

Histogram can be generated using `df.plot.hist()`
method

Since we have limited numeric data in our rodent data, 
I used student achievement data to present it:

In [None]:
url2 = 'https://raw.githubusercontent.com/JoannaWuWeijia/Data_Store_WWJ/main/grades_example.csv'
df2 = pd.read_csv(url2)

In [None]:
df2["Grade"].plot.hist(bins = 10, figsize=(8, 6))

As can be seen from the plot, the students' scores show a normal distribution, 
with most of them clustered in the 70-80 range

In [None]:
df2["Grade"].plot.density()

#### Scatter Plots
When dealing with two variables, scatter plot allow us to 
examine if there is any correlation between them.

Scatter can be generated using `df.plot.scatter(x = col1, y = col2)` method.

In [None]:
url3 = 'https://raw.githubusercontent.com/JoannaWuWeijia/Data_Store_WWJ/main/student_example3.csv'
df3 = pd.read_csv(url3)

In [None]:
df3.plot.scatter(x="Weight", y="Height", figsize=(8, 6))

As you can see it's roughly a linear regression, 
and I'll cover how to add a regression line in the next sns section.

### Seaborn
+ Seaborn is designed to work directly with pandas DataFrames, 
making plotting more convenient by allowing direct use of DataFrame 
columns for specifying data in plots.

+ Seaborn makes it easy to add linear regression lines and other 
statistical models to your charts, simplifying the process of statistical data visualization.

+ Seaborn's default styles and color are more aesthetically 
pleasing and modern compared to Matplotlib.

#### Installation of `Seaborn`

```
pip install seaborn
```

#### Histogram and Density Plots

In [None]:
## help(sns.histplot) 

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(df2['Grade'], bins=10, kde = True)

`bins`: The number of bars in the histogram. 
More bins can make the data distribution more detailed, 
but too many may cause the chart to be difficult to understand; 
fewer bins may not be able to show the data distribution accurately.
`kde `: (Kernel Density Estimate Line) a density curve will 
be added to the histogram, which is generated by kernel density 
estimation and can help understand the shape of the data distribution

#### Scatter plot with Regression line

I used an example with less data to be able to show it. 
We can see that the height and weight of the students are directly proportional.

In [None]:
df4 = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie', 'David', 
    'Eva', 'Fiona', 'George', 'Hannah', 'Ian', 'Julia'],
    'Height': [160, 172, 158, 165, 170, 162, 175, 168, 180, 155],
    'Weight': [55, 72, 60, 68, 62, 56, 80, 65, 75, 50]})

plt.figure(figsize = (8, 6))
sns.regplot(x='Weight', y='Height', data=df4)

#### Categorical Data

##### barplot

In [None]:
np.random.seed(0) 
genders = np.random.choice(['Male', 'Female'], size=500)
classes = np.random.choice(['A', 'B', 'C', 'D'], size=500)
grades = np.random.choice(['Excellent', 'Good', 'Average', 'Poor'], size=500)
df4 = pd.DataFrame({'Gender': genders, 'Class': classes, 'Grades': grades})

In [None]:
sns.catplot(x='Class', hue='Gender', col='Grades', 
kind='count', data=df4, height=5, col_wrap=2)
plt.show()

+ `x='Class'`: This sets the x-axis to represent different classes, 
so each class will have its own set of bars in the plot.

+ `hue='Gender'`: This parameter adds a color coding (hue) based on the 'Gender' column

+ `col='Grades'`: This creates separate subplots (columns) for 
each unique value in the 'Grades' column (e.g., Excellent, Good, Average, Poor), 
effectively grouping the data by grades.

+ `col_wrap=2`: Limits the number of these subplots to 2 per row. 
If there are more than 2 unique grades, additional rows 
will be created to accommodate all the subplots.

+ `kind='count'`: Specifies the kind of plot to draw. In this case, 
`'count' `means it will count the occurrences of each category 
combination and display this as bars in the plot.

+ `height`=5: Sets the height of each subplot to 5 inches.

##### Box Plot

In [None]:
sns.boxplot(x='Gender', y='Grades', hue='Class', data=df4)
plt.show()

+ `x='Gender'`:  x-axis variable

+ `y='Grades'`: y-axis variable, which in this case is 'Grades'. 
Since 'Grades' is a categorical variable with values like 'Excellent', 'Good', 'Average', 'Poor'

+ `col='Class'`: Creates separate subplots for each unique value 
in the 'Class' column, effectively grouping the data by class.


##### Categorical Data Help

In [None]:
##help(sns.catplot)

### Conclusion

`Matplotlib` is the foundation for making plots in Python.

`pandas` uses Matplotlib for its plotting features but is mainly for handling data.

`Seaborn` makes Matplotlib prettier and easier to use, especially with pandas data.




### References
+ https://matplotlib.org/stable/users/project/history.html
+ https://matplotlib.org/stable/gallery/lines_bars_and_markers/simple_plot.html
+ https://www.simplilearn.com/tutorials/python-tutorial/matplotlib
+ https://www.w3schools.com/python/pandas/pandas_plotting.asp
+ https://github.com/mwaskom/seaborn/tree/master/seaborn
+ https://seaborn.pydata.org/installing.html
+ https://ritza.co/articles/matplotlib-vs-seaborn-vs-plotly-vs-MATLAB-vs-ggplot2-vs-pandas/

## Grammar of Graphics with `Plotnine`

This section was written by Olivia Massad.


### Introduction

Hello everyone! My name is Olivia Massad and I am a junior Statistical
Data Science Major. I am very interested in sports statistics and analytics,
especially involving football, and am very excited to learn more about coding 
and data science in this class. Today I will be talking about grammar of 
graphics for python, using `Plotnine`. This is a new topic for me so I am very
excited to show you all what we can do with it.


### What is Grammar of Graphics?

Similarly to how languages have grammar in order to structure language and create
a standard for how sentences and words should be arranged, grammar of graphics 
provides the framework for a consistent way to structure and create statistical 
visualizations. This framework helps us to create graphs and visualizations which 
can be widely understood due to the consistent structure. The major components 
of grammar of graphics are:

- Data: our datasets and the what components you want to visualize.

- Aesthetics: axes, position of data points, color, shape, size.

- Scale: scale values or use specific scales depending on multiple values
and ranges.

- Geometric objects: how data points are depicted, whether they're points,
lines, bars, etc.

- Statistics: statistical measures of the data included in the graphic, 
including mean, spread, confidence intervals, etc.

- Facets: subplots for specific data dimensions.

- Coordinate system: cartesian or polar.


### What can you do with `Plotnine`?

`Plotnine` is a program which implements grammar of graphics in order to 
create data visualizations and graphs using python. It is based on `ggplot2`
and allows for many variations within graphs. Some examples of things we can
create with `plotnine` are:

- Bar Charts
- Histograms
- Box Plots
- Scatter Plots
- Line Charts
- Time Series
- Density Plots
- etc.


### Using `Plotnine`

In order to use `plotnine` we first need to install the package using 
our command line.

With `conda`:
"conda install -c conda-forge plotnine"

With `pip`:
"pip install plotnine
pip install plotnine[all]"

Now that `plotnine` is installed, we must call the it in python.

In [None]:
from plotnine import *
from plotnine.data import *

Now that `plotnine` is installed and imported, we can begin to make
graphs and plots. Below are different examples of visualizations we
can make using `plotnine` and the personalizations we can add to them. 
For these graphics I used the rodent sighting data from the NYC open data
311 requests. We also will need pandas and numpy for some of these graphs
so we need to import those as well. Additionally, because the data set is
so large, we will only be lookng at the first 500 complaints.

In [None]:
from plotnine import *
from plotnine.data import *
import pandas as pd 
import numpy as np 
import os
folder = 'data'
file = 'rodent_2022-2023.feather'
path = os.path.join(folder, file)
data = pd.read_feather(path)
data_used = data.head(500)

#### Bar Chart

One common type of visualization we can create with `plotnine` is a 
bar chart. For this graph we will look at the data for the descriptors
of each complaint.

In [None]:
(ggplot(data_used, aes(x = 'descriptor')) 
    + geom_bar())

While this code provides us with a nice simple chart, because we are using
`plotnine`, we can make some major improvements to the visualization to
make it easier to read and more appealing. Some simple things we can do are:

- Add a title.
- Color code the bars. 
- Change the orientation of the graph.
- Add titles to the axes.

In [None]:
(ggplot(data_used, aes(x = 'descriptor', fill = 'descriptor')) 
        # Color code the bars.
    + geom_bar() # Bar Chart
    + ggtitle('Descriptor Counts') # Add a title.
    + coord_flip() # Change the orientation of the graph.
    + xlab("Descriptor") # Add title to x axis.
    + ylab("Number of Complaints") # Add titles to y axis.
)

Some more complex changes we can make to our graph are:

- Change the orientation of the words on the axes to make them easier to read.
- Add color coded descriptors to each bar.

In [None]:
(ggplot(data_used, aes(x = 'descriptor', fill = 'borough')) 
        # Add color coded descriptors.
    + geom_bar() # Bar Chart
    + ggtitle('Descriptor Counts') # Add a title.
    + xlab("Descriptor") # Add title to x axis.
    + ylab("Number of Complaints") # Add titles to y axis.
    + theme(axis_text_x=element_text(angle=45))
     # Change the orientation of the words.
)

#### Scatter Plot

Another common visualization we can create is a scatterplot. When looking 
at the data from the 311 requests, we can see that there are many data 
points for locations of these complaints. A scatter plot would be a great 
way to see the location of the complaints by graphing the longitudes and 
latitudes. In order to better see the points, for this 
graph we will only use the first 200 complaints.

In [None]:
data_scatter = data.tail(200)
(ggplot(data_scatter, aes(x = 'longitude', y = 'latitude')) 
    + geom_point())

Similarly to the original code for the bar chart, this code provides a
very simple scatter plot. `Plotnine` allows us to add many specializations 
to the scatterplot in order to differentiate the points from each other. 
We can:

- Add color to the points.
- Differentiate using point size.
- Differentiate using point shape.

In [None]:
(ggplot(data_scatter, aes(x = 'longitude', y = 'latitude',
       color = 'location_type')) # Add color to the points.
    + geom_point())

In [None]:
(ggplot(data_scatter, aes(x = 'longitude', y = 'latitude',
    size = 'descriptor', # Differentiate using point size.
    shape = 'borough')) # Differentiate using point shape.
    + geom_point())

We can see that due to the close data points, filtering the data using
size and shape can become a little congested. One thing we can do to fix
this while still viewing the same data is through the use of "facet_grid".

In [None]:
(ggplot(data_scatter, aes(x = 'longitude', y = 'latitude',
    shape = 'borough')) # Differentiate using point shape.
    + geom_point()
    + facet_grid('descriptor ~ .') # Create multiple plots.
)

In [None]:
(ggplot(data_scatter, aes(x = 'longitude', y = 'latitude'))
    + geom_point()
    + facet_grid('descriptor ~ borough') 
        # Create multiple plots with 2 conditions.
    + theme(strip_text_y = element_text(angle = 0), # change facet text angle
        axis_text_x=element_text(angle=45)) # change x axis text angle
)

#### Histogram

The last common graph we will cover using `plotnine` is a histogram.
Here we will use the created date data as a continuous variable. Using 
`plotnine` we are able to make many of the same personalizations we 
were able to do with bar charts.

In [None]:
data_used['created_date']=pd.to_datetime(
  data_used['created_date'],
  format = "%m/%d/%Y %I:%M:%S %p", errors='coerce')
(ggplot(data_used, aes(x='created_date'))
    + geom_histogram())

Now that we have a simple histogram with our data we can add specializations,
inclduing:

- Change width of bins.
- Change oreintation of graph.
- Add color coded descriptors.
- Change outline color.
- Change the orientation of the words on the axes to make them easier to read.

In [None]:
(ggplot(data_used, aes(x='created_date', fill = 'borough')) 
        # Add color coded descriptors.
    + geom_histogram(binwidth=1,  # Change width of bins
      color = 'black') # Change outline color.
    + theme(axis_text_x=element_text(angle=45)) 
        # Change the orientation of the words.
)

In [None]:
(ggplot(data_used, aes(x='created_date', fill = 'borough')) 
        # Add color coded descriptors.
    + geom_histogram(binwidth=1,  # Change width of bins
      colour = 'black') # Change outline color.
    + coord_flip() # Change oreintation of graph.
)

While we're able to color code the histogram to show other descriptors 
of the data, another way we can do this with `plotnine` is through the use
of multiple graphs. Using "facet_wrap" we can create a multi facet graph with 
the same data.

In [None]:
(ggplot(data_used, aes(x='created_date')) 
    + geom_histogram(binwidth=1) # Change width of bins
    + facet_wrap('borough') # Create multiple graphs.
    + theme(axis_text_x=element_text(angle=45)) 
    # Change the orientation of the words.
)

#### Density Plot

The last visualization we're going to look at is density plots. While less 
common than the graphs previously discussed, density plots show the 
distribution of a specific variable.

In [None]:
(ggplot(data_used, aes(x='created_date'))
    + geom_density())

Above we can see a very simple density graph with very little description. Using
`plotnine` we are able to:

- Add color coded descriptors.
- Scale groups by relative size.
- Change the orientation of the words on the axes to make them easier to read.

In [None]:
(ggplot(data_used, aes(x='created_date', color = 'descriptor')) 
        #Add color coded descriptors.
    + geom_density()
    + theme(axis_text_x=element_text(angle=45)) 
        # Change the orientation of the words.
)

In [None]:
(ggplot(data_used, aes(x='created_date', color = 'descriptor')) 
        #Add color coded descriptors.
    + geom_density(aes(y=after_stat('count'))) 
        # Scale groups by relative size.
    + theme(axis_text_x=element_text(angle=45)) 
        # Change the orientation of the words.
)

### Resources

- <https://plotnine.readthedocs.io/en/v0.12.4/gallery.html>

### References

- “Plotnine.Geoms.Geom_bar¶.” Plotnine.Geoms.Geom_bar - Plotnine Commit: 
D1f7dbf Documentation, plotnine.readthedocs.io/en/stable/generated/
plotnine.geoms.geom_bar.html. 
Accessed 13 Feb. 2024. 

- “Plotnine.Geoms.Geom_density¶.” Plotnine.Geoms.Geom_density - 
Plotnine Commit: D1f7dbf Documentation, plotnine.readthedocs.io/en/
stable/generated/plotnine.geoms.geom_density.html. 
Accessed 17 Feb. 2024. 

- “Plotnine.Geoms.Geom_histogram¶.” Plotnine.Geoms.Geom_histogram - 
Plotnine Commit: D1f7dbf Documentation, plotnine.readthedocs.io/en/
stable/generated/plotnine.geoms.geom_histogram.html#plotnine.
geoms.geom_histogram. Accessed 17 Feb. 2024. 

- “Plotnine.Geoms.Geom_point¶.” Plotnine.Geoms.Geom_point - 
Plotnine Commit: D1f7dbf Documentation, plotnine.readthedocs.io/en/
stable/generated/plotnine.geoms.geom_point.html. 
Accessed 16 Feb. 2024. 

- “Plotnine.” PyPI, pypi.org/project/plotnine/. Accessed 13 Feb. 2024. 

- Sarkar, Dipanjan (DJ). “A Comprehensive Guide to the Grammar of Graphics 
for Effective Visualization of Multi-Dimensional...” Medium, Towards Data 
Science, 13 Sept. 2018, towardsdatascience.com/a-comprehensive-guide-to-the-
grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149. 

## Handling Spatial Data with `GeoPandas`

This section was written by Pratham Patel.

### Introduction and Installation

Hello! my name is Pratham Patel and I am a Senior due to graduate this 
semster with a Bachelor's Degree of Science in Mathematics/Statistics 
with a Computer Science minor. I hope to gain skills in using various 
different packages of Python in this course, as well as understand 
even more about the Data Science field. An example of learning new 
Python packages is the topic I will present today on the `geopandas` 
package. GeoPandas is an extension of the `pandas` package to support 
geographic data in its dataframes.

The GeoPandas package can be installed via the terminal using any of the following commands.

The documentation recommends:
`conda install -c conda-forge geopandas`

Standard conda install:
`conda install geopandas`

Using pip:
`pip install geopandas`

### Base Concepts

GeoPandas relvolves around the `GeoDataFrame` object, which is essentially the 
pandas `DataFrame` object, with all the traditional capabilities in addition to 
the ability store and operate on geometry columns.

The geometry types include points, lines and closed polygons (the first and last 
coordinates in the list must be the same).

The objects made by `shapely.geometry` can represent these geometry types:

In [None]:
from shapely.geometry import Point, LineString, Polygon
import geopandas as gpd

point = Point(0, 1)
gdf1 = gpd.GeoDataFrame(geometry=[point])

line = LineString([(0, 0), (1, 1)])
gdf2 = gpd.GeoDataFrame(geometry=[line])

#note: the first and last element of 
#the list of tupled points are the same
polygon = Polygon([(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)])
gdf3 = gpd.GeoDataFrame(geometry=[polygon])

In [None]:
gdf1

Some of the basic attributes of a GeoSeries include:
* `length`: returns length of a line

In [None]:
gdf2.length

* `area`: returns the area of the polygon

In [None]:
gdf3.area

* `bounds`: gives the bounds of each row in a column of geometry

* `total_bounds`: gives the bounds of a geometry series

* `geom_type`: returns geometry type

In [None]:
gdf1.geom_type

* `is_valid`: return True for valid geometries and false otherwise (mostly important for polygons).

In [None]:
gdf3.is_valid

Next, we will cover various methods to be used on GeoSeries objects:

* `distance()`: returns the Series with the minimum distance from each entry to another geometry or 
Series (argument `other`).
    + Note: a secondary argument `align` is a boolean to align the GeoSeries by index if set to True

In [None]:
gdf2.distance(Point((1,0)))
gdf2.distance(LineString([(0, 2), (1, 2)]))

* `centroid`: returns a new GeoSeries with the center of each row's geometry.

In [None]:
gdf3.centroid

* `contains()`: returns True if the shape contains a specific geometry or Series.
    + parameters `other` and `align`

In [None]:
gdf3.contains(Point((0.5, 1.5)))

In [None]:
gdf3.contains(gdf1)

* `intersects()` returns true if shape intersects another geometry of series
    + parameters `other` and `align`

### Reading Files into `GeoDataFrame`'s

The function `geopandas.read_file()` is the best way to read a file 
with both data and geometry into a `GeoDataFrame` object. From 
here, we will be using the nyc rodent data and visualize it. The 
code below converts every incident's location into a point 
on the geometry.

In [None]:
# Reading csv file 
import pandas as pd 
import numpy as np
# Shapely for converting latitude/longtitude to a point
from shapely.geometry import Point 
# To create GeoDataFrame
import geopandas as gpd 

#read in the feather file as a generic pandas DataFrame
rat_22_23 = pd.read_feather('data/rodent_2022-2023.feather')

# creating geometry using shapely (removing missing points) for the already built in longitude and latitude coordinates
geometry = [Point(xy) for xy in zip(rat_22_23["longitude"], rat_22_23["latitude"]) if not Point(xy).is_empty]

# creating geometry column to be used by geopandas using the points_from_xy method
geo = gpd.points_from_xy(rat_22_23["longitude"], rat_22_23["latitude"])

# coordinate reference system (epsg:4326 implies geographic coordinates)
crs = {'init': 'epsg:4326'}

# create GeoDataFrame (takes care of the missing coordinates) 
rodent_gdf = gpd.GeoDataFrame(rat_22_23.loc[~pd.isna(rat_22_23["longitude"]) & ~pd.isna(rat_22_23["latitude"])], crs=crs, geometry=geometry)

Here, we can take a view at the new GeoDataFrame:

In [None]:
rodent_gdf.head()

### Plotting
The new geometry allows us to plot the data easily.

In [None]:
#standard plot of every single rodent incident
rodent_gdf.plot()

#color the plot by borough
rodent_gdf.plot(column = 'borough', legend=True)

#color the plot by borough, with more transparent markers
rodent_gdf.plot(column = 'borough', alpha = 0.01)

#color by the descriptor of the incident
rodent_gdf.plot(column = 'descriptor', legend=True)

#Plot the missing information for borough
rodent_gdf.plot(column='borough', missing_kwds={'color':'red'})

#color the plot by zipcode
rodent_gdf.plot(column = 'incident_zip', legend=True)

Note that if an integer column is passed, the legend will present the key as a gradient by default.


You can individualize each zipcode using categorical=True, though be sure the list 
of unique integers is not too large.

`rodent_gdf.plot(column = 'incident_zip', legend=True, categorical=True)`

The geographic visualizations allow us to try to observe some trends amongst the reported rodent incident we see.

### Interactive Maps
A very interesting aspect is the ability to create interactive graphs using the `.explore()` method.

Note that `folium`, `matplotlib`, and `mapclassify` are necessary for the `.explore()` function.

In [None]:
#interactive map with incidents colored by borough
rodent_gdf.explore(column='borough', legend=True)

This map lets us specifically find various points and examine them and their surroudings.


### Setting and Changing Projections
In the code, a Coordinate Reference System(CRS) was set using `crs = {'init':'epsg:4326'}`. 
CRS can be set on on initialized GeoDataFrame using the `.set_crs` function. We can do this 
for our previous example `gdf3`:

In [None]:
gdf3 = gdf3.set_crs("EPSG:4326")
gdf3.plot()

There are other CRS's that can be set by the `.to_crs()` function. 
Examples include:
* ESPG:2263 - coordinates labeled in feet
* ESPG:3395 - World Mercator System


### References

* GeoPandas Documentation: 
    + https://geopandas.org/en/stable/index.html

* NYC 311 Service Request
    + https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/about_data
