# Data Exploration with Python and Jupyter

Basic usage of the Pandas library to download a dataset,
explore its contents, clean up missing or invalid data,
filter the data according to different criteria,
and plot visualizations of the data.

- [Part 1: Python and Jupyter](https://ssciwr.github.io/jupyter-data-exploration)
- [Part 2: Pandas with toy data](https://ssciwr.github.io/jupyter-data-exploration/pandas-toy-data.slides.html)
- **Part 3: Pandas with real data**

*Press `Spacebar` to go to the next slide (or `?` to see all navigation shortcuts)*

# Let's download some real data

For some reason, the London Fire Brigade provides a public spreadsheet of all animal rescue incidents since 2009:

https://data.london.gov.uk/dataset/animal-rescue-incidents-attended-by-lfb

They provide a link to the dataset in csv (comma-delimited) format

In [None]:
# import the Pandas library & matplotlib for plotting

import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# download a csv file with some data and convert it to a DataFrame
url = "https://data.london.gov.uk/download/animal-rescue-incidents-attended-by-lfb/8a7d91c2-9aec-4bde-937a-3998f4717cd8/Animal%20Rescue%20incidents%20attended%20by%20LFB%20from%20Jan%202009.csv"
df = pd.read_csv(url, encoding="unicode_escape")

# Sidenote

You may be wondering what this `encoding='unicode_escape'` parameter is for, and how did I know it needed to be there?

Well, I first tried `pd.read_csv(url)` and got this:

```
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/tmp/ipykernel_514329/874975015.py in <module>
----> 1 pd.read_csv(url)

~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    608     kwds.update(kwds_defaults)
    609 
--> 610     return _read(filepath_or_buffer, kwds)
    611 
    612 

~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    460 
    461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
    463 
    464     if chunksize or iterator:

~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    817             self.options["has_index_names"] = kwds["has_index_names"]
    818 
--> 819         self._engine = self._make_engine(self.engine)
    820 
    821     def close(self):

~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1048             )
   1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1051 
   1052     def _failover_to_python(self):

~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1896 
   1897         try:
-> 1898             self._reader = parsers.TextReader(self.handles.handle, **kwds)
   1899         except Exception:
   1900             self.handles.close()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 105: invalid start byte
```

## Suggested workflow / philosophy

- you want to do something
  - if you know / have a guess which function to use, look at its docstring: `?function_name`
  - if you don't have any idea what to try, google `how do I ... in pandas`
  - if in doubt, just try something!
- if you get an error, copy & paste the last bit into google (along with `funtion_name` and/or `pandas`)
  - don't be intimidated by the long and apparently nonsensical error messages
  - almost certainly someone else has had this exact problem
  - almost certainly the solution is waiting for you
- look for a stackoverflow answer with many up-votes
  - ignore the green tick, this just means the person asking the question liked the answer
  - typically an answer with many up-votes is a better option
  - more recent answers can also be better: sometimes a library has changed since an older answer was written

(For anyone who wasn't already doing this, that may be the most useful thing in this course)

# Display the DataFrame

In [None]:
df

# Column data types

In [None]:
df.dtypes

# Convert DateTimeOfCall to a date-time

In [None]:
df["DateTimeOfCall"].head()

In [None]:
# this looks like what we want..
pd.to_datetime(df["DateTimeOfCall"]).head()

In [None]:
# ..but which number is the month and which is the day?
# how can we check if what we just did makes sense?
pd.to_datetime(df["DateTimeOfCall"]).plot()
# should be a single monotonically increasing line: looks like days/months are mixed up

In [None]:
# check the docs to see what options are available:
?pd.to_datetime

In [None]:
# this looks better
pd.to_datetime(df["DateTimeOfCall"], dayfirst=True).head()

In [None]:
# consistency check: plot datetime vs index, should increase monotonically
pd.to_datetime(df["DateTimeOfCall"], dayfirst=True).plot()

In [None]:
# replace DateTimeOfCall column in dataframe with this one
df["DateTimeOfCall"] = pd.to_datetime(df["DateTimeOfCall"], dayfirst=True)

# Use the datetime as the index

In [None]:
df.set_index("DateTimeOfCall", inplace=True)

In [None]:
df

In [None]:
# can now use datetime to select rows: here is jan 2021
df.loc["2021-01-01":"2021-01-31", "FinalDescription"]

In [None]:
# can resample the timeseries: sum by month
df.resample("M")["IncidentNumber"].count().plot(title="Monthly Calls")
plt.show()

In [None]:
# or by day
df.resample("d")["IncidentNumber"].count().plot(title="Daily Calls")
plt.show()

# Missing data

Different strategies for dealing with missing data:

- Ignore the issue
  - some things may break / not work as expected
- Remove rows/columns with missing data
  - remove all rows with missing data: `df.dropna(axis=0)`
  - remove all columns with missing data: `df.dropna(axis=1)`
- Guess (impute) missing data
  - replace all missing entries with a value: `df.fillna(1)`
  - replace missing entries with mean for that column `df.fillna(df.mean())`
  - replace each missing entry with previous valid entry: `df.fillna(method="pad")`
  - replace missing by interpolating between valid entries: `df.interpolate()`

In [None]:
# count missing entries for each column
df.isna().sum()

In [None]:
# If PumpCount is missing, typically so is PumpHoursTotal
# 55 rows are missing at least one of these
pump_missing = df["PumpCount"].isna() | df["PumpHoursTotal"].isna()
print(pump_missing.sum())

In [None]:
# so we could choose to drop these rows
df1 = df.drop(df.loc[pump_missing == True].index)
# here we made a new dataset df1 with these rows dropped
# to drop the rows from the original dataset df, could do:
#
# df = df.drop(df.loc[pump_missing == True].index)
#
# or:
#
# df.drop(df.loc[pump_missing == True].index, inplace=True)
#
print(len(df1))

In [None]:
# another equivalent way to do this
df2 = df.dropna(subset=["PumpCount", "PumpHoursTotal"])
print(len(df2))

In [None]:
# but if we drop them, we lose valid data from other columns
# let's look at the distribution of values:
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
df.plot.hist(y="PumpCount", ax=axs[0])
df.plot.hist(y="PumpHoursTotal", ax=axs[1])
plt.plot()

In [None]:
# looks like it would be better to replace missing PumpCount and PumpHoursTotal fields with 1
?df.fillna
df.fillna({"PumpCount": 1, "PumpHoursTotal": 1}, inplace=True)

In [None]:
df.isna().sum()

# Count the unique entries in each column

In [None]:
df.nunique().sort_values()

In [None]:
# this column is always the same:
df["TypeOfIncident"].unique()

In [None]:
# so not interesting - we can drop it from our dataframe:
df.drop("TypeOfIncident", axis=1, inplace=True)

In [None]:
# "cat" and "Cat" are treated as different animals here:
df["AnimalGroupParent"].unique()

In [None]:
# select rows where AnimalGroupParent is "cat", replace with "Cat"
df.loc[df["AnimalGroupParent"] == "cat", "AnimalGroupParent"] = "Cat"

In [None]:
df["AnimalGroupParent"].unique()

In [None]:
df.groupby("AnimalGroupParent")["IncidentNumber"].count().sort_values().plot.barh(
    logx=True
)
plt.show()

In [None]:
df["SpecialServiceType"].unique()

In [None]:
# apparently different hourly costs
# does it depend on the type of event? or does it just increase over time?
df["HourlyNotionalCost(£)"].unique()

In [None]:
# just goes up over time
df["HourlyNotionalCost(£)"].plot.line()

In [None]:
df.groupby("StnGroundName")["IncidentNumber"].count()

## Plot location of calls on a map

In [None]:
# drop missing longitude/latitude
df2 = df.dropna(subset=["Longitude", "Latitude"])
# also drop zero values
df2 = df2[df2["Latitude"] != 0]
# convert to geodataframe using geopandas
import geopandas

# set crs to EPSG:4326 to specify WGS84 Latitude/Longitude
gdf = geopandas.GeoDataFrame(
    df2,
    geometry=geopandas.points_from_xy(df2["Longitude"], df2["Latitude"]),
    crs="EPSG:4326",
)
gdf.head()

In [None]:
import contextily as cx

f, ax = plt.subplots(figsize=(16, 16))
# plot location of calls involving animals
gdf[gdf["AnimalGroupParent"] == "Cow"].plot(
    ax=ax, color="black", alpha=0.3, label="Cow"
)
gdf[gdf["AnimalGroupParent"] == "Deer"].plot(
    ax=ax, color="red", alpha=0.3, label="Deer"
)
gdf[gdf["AnimalGroupParent"] == "Fox"].plot(ax=ax, color="blue", alpha=0.3, label="Fox")
# add a basemap of the region using contextily
cx.add_basemap(ax, crs=gdf.crs)
plt.title("Call locations by animal")
plt.legend()
plt.axis("off")
plt.show()