# Processing data with pandas


```{attention}

These tutorials porpuse is only giving you some hints about completing the assignment tasks, not teaching the python programming step by step.
<br/>
As mentioned in the first lecture of the course, this course requires basic knowledge of Python programming language. 

If you are not familiar with them, try catching up with the basics as fast as you can. Some useful resources are: <br/>

[pandas online documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)

<br/>

[Python for Data Analysis (3rd Edition, 2022)](https://wesmckinney.com/book/)

```


Our assignment 1 is about to learn how to read and explore data files in Python. We will focus on using [pandas](https://pandas.pydata.org/pandas-docs/stable/) which is an open-source package for data analysis in Python. pandas is an excellent toolkit for working with *real world data* that often have a tabular structure (rows and columns).



## Input data: Community Crime Statistics Map

Our input data in this tutorial is a text file containing community crimes statistics map in city of Calgary, Alberta, Canda retrieved from [City of Calgary Open Data Portal](https://data.calgary.ca/Health-and-Safety/Community-Crime-Statistics-Map/n24v-9r86):

- File name: [Community_Crime_Statistics_20240120.csv] (you can have a look at the file before reading it in using pandas!)
- You can download the data from the link provided: [City of Calgary Open Data Portal](https://data.calgary.ca/Health-and-Safety/Community-Crime-Statistics-Map/n24v-9r86)
- Data is provided monthly by the Calgary Police Service. And includes the location of crime, time, category, crime count, and resident count.
- There are totally 67,262 rows and 10 columns in this dataset.


## Loading Data

Next, we wll read the input data file, and store the contents of that file in a variable called `data` Using the `pandas.read_csv()` function:

In [2]:
# Importing the libraries
import pandas as pd

In [11]:
# load data

# Read the file using pandas
data = pd.read_csv("Community_Crime_Statistics_20240120.csv", sep=',')

```{admonition} Reading different file formats
Check out the [pandas documentation about input and output functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5) and [Chapter 6](https://wesmckinney.com/book/accessing-data) in McKinney (2022) for more details about reading data.
```

Let’s now print the dataframe and see what it looks like. 

We can use the `data.head()` function of the pandas DataFrame object to quickly check the top rows. We can also check the last rows of the data using `data.tail()`

In [13]:
# print the first 5 rows of data
data.head()

Unnamed: 0,Sector,Community Name,Category,Crime Count,Resident Count,Date,Year,Month,ID,Community Center Point
0,NORTHWEST,ARBOUR LAKE,Theft OF Vehicle,2,10619.0,2022/04,2022,APR,2022-APR-ARBOUR LAKE-Theft OF Vehicle,POINT (-114.20767498075155 51.1325947114686)
1,CENTRE,BANFF TRAIL,Theft OF Vehicle,2,4153.0,2023/10,2023,OCT,2023-OCT-BANFF TRAIL-Theft OF Vehicle,POINT (-114.11512839716917 51.07421633024228)
2,EAST,DOVER,Theft OF Vehicle,5,10351.0,2022/12,2022,DEC,2022-DEC-DOVER-Theft OF Vehicle,POINT (-113.99305400906283 51.02256772250409)
3,CENTRE,GREENVIEW,Assault (Non-domestic),2,1906.0,2020/12,2020,DEC,2020-DEC-GREENVIEW-Assault (Non-domestic),POINT (-114.05746990262463 51.09485613506574)
4,NORTHWEST,HAMPTONS,Theft FROM Vehicle,3,7382.0,2019/08,2019,AUG,2019-AUG-HAMPTONS-Theft FROM Vehicle,POINT (-114.14668419231347 51.14509283969437)


Let's see some basic info about the data (The number of column, rows, the data type of each column etc.)

In [14]:
# Check the data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67262 entries, 0 to 67261
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sector                  67231 non-null  object 
 1   Community Name          67262 non-null  object 
 2   Category                67262 non-null  object 
 3   Crime Count             67262 non-null  int64  
 4   Resident Count          67193 non-null  float64
 5   Date                    67262 non-null  object 
 6   Year                    67262 non-null  int64  
 7   Month                   67262 non-null  object 
 8   ID                      67262 non-null  object 
 9   Community Center Point  67231 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 5.1+ MB


Also some basic stats about numerical values in columns that contain numbers (`Crime count`, `Resident count`, `Year`)

In [15]:
# Cehck the data description
data.describe()

Unnamed: 0,Crime Count,Resident Count,Year
count,67262.0,67193.0,67262.0
mean,2.879932,6498.019883,2020.460379
std,3.681941,5456.475655,1.713162
min,1.0,0.0,2018.0
25%,1.0,2263.0,2019.0
50%,2.0,5957.0,2020.0
75%,3.0,9244.0,2022.0
max,110.0,25710.0,2023.0
