Formatting text in Colaboratory: A guide to Colaboratory markdown
===

## What is markdown?

Colaboratory has two types of cells: text and code. The text cells are formatted using a simple markup language called markdown, based on [the original](https://daringfireball.net/projects/markdown/syntax).

More markdown examples:**bold text**

\**italics*\* and \__italics_\_

**\*\*bold\*\***

\~\~~~strikethrough~~\~\~


No indent
>\>One level of indentation
>>\>\>Two levels of indentation

An ordered list:
1. 1\. One
1. 1\. Two
1. 1\. Three

An unordered list:
* \* One
* \* Two
* \* Three

In [1]:
# this is a comment - will not execute if you run this cell
# comments are used to explain what a certain snippet of code is for
# to add a comment, just type a "#" before any text

## In this exercise, we will use pandas to load a csv file

In [3]:
# pandas is a library in Python commonly used for data analysis and data manipulation

import pandas as pd

## In this exercise, let's  upload the DataSeerGrabPrizeData from the case study

In [5]:
# in order to upload a file from your computer to this Colab notebook, we will be using a module from google.colab called "files"
# this upload might take a while because of the size of the dataset (265k+ rows)

from google.colab import files
files.upload()

Saving DataSeerGrabPrizeData.csv to DataSeerGrabPrizeData.csv


{'DataSeerGrabPrizeData.csv': b'source,created_at_local,pick_up_latitude,pick_up_longitude,drop_off_latitude,drop_off_longitude,city,fare,pick_up_distance,state\r\nADR,2013-09-22 23:46:18,14.604348,120.998654,14.53737,120.994423,Metro Manila,281.875,0.389894,CANCELLED\r\nT47,2013-11-04 3:51:59,14.590099,121.082645,14.508611,121.019444,Metro Manila,413.125,2.20977,COMPLETED\r\nT47,2013-11-21 5:21:24,14.582707,121.061458,14.537752,121.001379,Metro Manila,277.5,2.70291,COMPLETED\r\nADR,2013-09-16 20:53:34,14.585812,121.060171,14.575915,121.085487,Metro Manila,220.625,0.321403,CANCELLED\r\nIOS,2013-09-10 23:49:16,14.55201,121.05126,14.63021,120.99592,Metro Manila,378.125,0.667067,COMPLETED\r\nT47,2013-11-12 21:26:18,14.589394,121.059928,14.444546,120.993874,Metro Manila,505,0.289595,COMPLETED\r\nIOS,2013-11-16 18:27:06,14.58645,121.04887,14.63925,121.03681,Metro Manila,229.375,1.54755,COMPLETED\r\nT47,2013-09-21 9:47:03,14.588782,121.097317,14.583526,121.05698,Metro Manila,211.875,2.91636,

In [6]:
# we will use a pandas to load the csv file we just uploaded to a dataframe

df = pd.read_csv('DataSeerGrabPrizeData.csv')

See pandas.read_csv documentation \[[here](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.read_csv.html)]


In [7]:
# we will use describe() to calculate statistical data from the dataset we just loaded

df.describe()

Unnamed: 0,pick_up_latitude,pick_up_longitude,drop_off_latitude,drop_off_longitude,fare,pick_up_distance
count,300.0,300.0,300.0,300.0,300.0,245.0
mean,14.56817,121.045472,14.570122,121.0328,206.747917,1.26506
std,0.105709,0.050816,0.051373,0.031681,127.126511,0.988334
min,12.879721,120.972865,14.417032,120.86712,0.0,0.031872
25%,14.551893,121.024334,14.550883,121.017265,167.03125,0.484101
50%,14.57216,121.0422,14.559313,121.02767,211.875,1.00964
75%,14.589131,121.061,14.589908,121.054867,269.84375,1.78689
max,14.702179,121.774017,14.88148,121.12269,811.25,5.16029


In this exercise, we'll be removing the rows with missing pick up distances. 

From inspection, all unallocated rides had no pick up distance value (i.e. if the ride was unallocated, passenger was not picked up)

In [8]:
# dropna() is a function that removes missing values

df = df.dropna()

In [9]:
# we'll run describe() again to check if rows with missing values are now dropped, i.e. if output shows reduced row count

df.describe()

Unnamed: 0,pick_up_latitude,pick_up_longitude,drop_off_latitude,drop_off_longitude,fare,pick_up_distance
count,245.0,245.0,245.0,245.0,245.0,245.0
mean,14.575107,121.041683,14.572584,121.03173,253.160714,1.26506
std,0.03672,0.027717,0.050846,0.032275,89.476141,0.988334
min,14.465721,120.9802,14.426363,120.86712,146.25,0.031872
25%,14.55184,121.02388,14.551694,121.016413,190.0,0.484101
50%,14.5721,121.03972,14.56071,121.02727,229.375,1.00964
75%,14.58948,121.060081,14.594508,121.05358,295.0,1.78689
max,14.688088,121.12269,14.88148,121.12269,811.25,5.16029


In [10]:
# we'll use another function from pandas to write the dataframe to another csv file
# we included "index = False" in the parameter to exclude the row numbers

df.to_csv('new.csv', index=False)

In [11]:
df.to_csv('with_row_numbers.csv')

In [12]:
files.download('new.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [13]:
files.download('with_row_numbers.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>