# DATA CLEANING + EDA (Pandas cont.)

Created By: Angelica Rojas

In [None]:
import pandas as pd
import re

## Upload Data

The data for this notebook could be found at this link: https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5 . 

For the purpose of this lesson we will use the CSV file of the data.

In [None]:
df = pd.read_csv("PATH_NAME")

#what does this do?
df.head()

Why did we only want to display the first 5 rows of the dataframe?

What if we wanted to see the size of this dataframe?

In [None]:
# number of rows

"CHECK # ROWS OF DF"

In [None]:
# shape of df (rows, columns)

"CHECK SHAPE OF DF"

# Part 1: DATA CLEANING

## Column Names

What do all these column names even mean? 

On that same website, BPD offers a narrative pdf file that describes the data they provided. https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5


<img src = "DF_col_desc.png">

## Change Column Name(s)

Why would we want to change the column names?

In [None]:
df = df.rename(columns={'ORIG_COL_NAME': 'NEW_COL_NAME'})
df.head()

## Investigating Columns

What is the difference between "Block_Location" and "BLKADDR" columns in the dataframe? From the look of the DF that is displayed it looks as though they are almost the same.

In [None]:
#Let's look at the first value in "Block_Location"
df["Block_Location"][0]

In [None]:
#Let's look at the first value in "BLKADDR"
df["BLKADDR"][0]

## Create New Columns

What is new information about "Block_Location" that we can actually use and save?

Let's create new columns for the information we extracted from those values.

In [None]:
coordinates = [["".join(x.split()) for x in re.split(r'[()]',i) if x.strip()][-1] for i in df["Block_Location"]]

#new values
longitude =[["".join(x.split()) for x in re.split(r'[,]',i) if x.strip()][-1] for i in coordinates]
latitude = [["".join(x.split()) for x in re.split(r'[,]',i) if x.strip()][0] for i in coordinates]

#create new columns for latitude and longitude
...

#Check if it worked
df.head()



## Drop Columns

We got all the information we needed from "Block_Location" keeping it would be taking up extra room in our dataframe. 

Let's drop the "Block_Location" from the dataframe.

In [None]:
df = df.drop("COL_NAME", axis = 1)

#Check if it dropped
df.head()

We can drop other columns that we do not think would add useful information to our analysis. 

Although we did get this data from the Berkeley PD, let's make sure all values in "City" are "Berkeley". Also, Let's make sure the "State" is "CA" for all values.

In [None]:
df.City.unique()

In [None]:
df.State.unique()

We checked all the unique values for columns "State" and "City" and they are the results we wanted, therefore, we do not need those columns anymore. 

Drop the columns listed above.

In [None]:
#drop City and State columns
df = ...

#Check if they dropped
df.head()

## Dealing With Null Values (NaN)

<img src = "null_def.png">

REFERENCE: https://pandas.pydata.org/pandas-docs/stable/missing_data.html

This is a big data set and we can't look through each value one at a time. How can we make sure that there is a value for each category?

In [None]:
df.isnull().sum()

Let's look at the rows where "BLKADDR" is a null value. Let's make a temporary sub-dataframe.

In [None]:
null_temp = df[pd.isnull(df['BLKADDR'])]
null_temp

Does the number of rows in the dataframe match the values above?

In [None]:
#get number of rows of new df

...

Investigate the dataframe, do you see somethng interesting that all these rows share?

Are the Latitude/Longitude values all the same for the "NaN" values?

In [None]:
#get unique values of latitude

...

In [None]:
#get unique values of longitude

...

## Boolean Slicing

Let's look at the whole dataset to see if there are any rows with that Latitude and Longitude combination that might have a "BLKADDR" associated with it.  

In [None]:
df[(df["LATITUDE"] == '37.869058') & (df["LONGITUDE"] == '-122.270455')]

In [None]:
#get unique values of BLKADDR for the Lat/Long combo

...

## Drop Null Values (NaN)

We could essentially go to Google and try to figure out the BLKADDR ourselves, but to remove any problems that may occur while searching let's just drop all the rows that include null values

In [None]:
#drop rows that have null values
df = df.dropna(axis = 0, how = "any")

In [None]:
#now find out how many null values within the df
#What should you see when you run this?
...

# <font color = "red"> YOUR TURN! </font>

### What is the difference between "EVENTDT" and "EVENTTM"? How can we clean our columns to reflect the data that is useful?

HINT: Focus on EVENTDT

In [None]:
#Slice the string to get the information you want and set to the variable
...

#Replace "EVENTDT" with new variable
...

#check if it worked
df.head()

### Drop the "InDbDate" column

In [None]:
...

#Check that it actually dropped
df.head()

### TRICKY QUESTION

### Column "D.O.W." can be a bit confusing with the numbers. Replace the numbers with the appropriate day it corresponds to. 

You can find the days it corresponds to in the beginning of the notebook. 

##### HINT: You may need to use a dictionary, the map function, or the zip function

DICT:
https://www.programiz.com/python-programming/methods/built-in/dict

MAP:
https://www.programiz.com/python-programming/methods/built-in/map

ZIP:
https://www.programiz.com/python-programming/methods/built-in/zip

In [None]:
# dow = {0:"Sunday",1:"Monday",2:"Tuesday",3:"Wednesday", 4:"Thursday", 5:"Friday", 6:"Saturday"}
# df["D.O.W."] = df["D.O.W."].map(dow)

#Check if it worked
df.head()

# Part 2: EXPLORATORY DATA ANALYSIS

<h3>"Exploratory data analysis or 'EDA' is a <b>critical</b> beginning step in analyzing the data from an experiment.</h3>

<b>Here are the main reasons we use EDA:</b>
<ul>
• detection of mistakes<br><br>
• checking of assumptions<br><br>
• preliminary selection of appropriate models<br><br>
• determining relationships among the explanatory variables, and<br><br>
• assessing the direction and rough size of relationships between explanatory and outcome variables."</ul>
REFERENCE: http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf


## Now what?

We have cleaned our data to the best of our ability based on the initial look. Now let's try to look at the <b>relationships</b> between different values. 

In [None]:
df.head()

Let's look at the different types of offenses that were called in. We know that using the .unique() function will return all the unique values in the column, but what if we wanted to also <b>count</b> the different times each unique value appeared?

In [None]:
df.OFFENSE.value_counts()

In [None]:
df.EVENTDESC.value_counts()

Why is "LARCENY" a higher occurence in the "EVENTDESC" column, if when we looked into the "OFFENSE" column, "BURGLARY - VEHICLE" is first? Let's look into this a little more.


## GroupBy 

In [None]:
df.groupby("EVENTDESC").OFFENSE.value_counts()

#turn the series into a DF 
df.groupby("EVENTDESC").OFFENSE.value_counts().to_frame()

From this DF we can see that "LARCENY" has the most OFFENSES within that category. When you add the totals from "THEFT MISD ( UNDER $ 950)" ," THEFT  FELONY (OVER $ 950)", and
"THEFT FROM PERSON" they equal more than "BURGLARY - VEHICLE", but "BURGLARY AUTO" as an offense alone is the highest in number.

# <font color = "red"> YOUR TURN! </font>

Could there be any relationship with the Day of the Week and the calls? Try out different functions to see if there is any significance?

In [None]:
#count the amount of calls per day
...

With the day that has the most calls, check the type of offense that appears the most.

In [None]:
#only display rows with the D.O.W that appears the most
#create a temp df
...

In [None]:
#count the number of offenses by type
...

In [None]:
#count the number of eventdesc by type
...

Do these numbers match the results of the overall DF?


Let's try something else, Friday and Saturday nights are typically associated with being the "party" time. If this is true should there be more Liquor/Drug/Disordely Conduct/etc. occurrences those nights?

Let's try it with Fridays!

In [None]:
#only display rows with the D.O.W of Friday
#create a temp df

fri =df[df["D.O.W."] == "Friday"]
fri.head()

In [None]:
#count the number of offenses by type
...

In [None]:
#count the number of events by type
...

This will not give us the information we want. Instead let us look at each EVENTDESC and group by the D.O.W. that appears the most per EVENTDESC.

Also, I realized that "D.O.W." is becoming a problem with the periods when I am trying to call my series. I want to change the name of the column again, how can I do that?

In [None]:
#change column name
df = ...
df.head()

# <font color = "red"> GROUP WORK</font> 
## What do YOU want to find out? YOUR DATA INVESTIGATION

In this notebook you have been learning all these techniques to be able to manipulate your dataframe to your preference. We know how to clean and explore our data, but what questions or topics did you actually want to learn from the data? 

<b> * In groups of 2-4 people, investigate the dataframe in this notebook and pick a question/topic to answer. Using the techniques you learned today, show relationships and results that would support that question/topic. 
</b><br><br>
<i>If we have time</i> <b>each</b> group will present their investigations and why they are significant to the class.
