# Why Pandas?

* Simple to use 
* Integrated with many other data science & ML Python tools
* Helps you get your data ready for ML algorithms 

In [1]:
import pandas as pd

## 2 main data types

### Series

* Series takes a python list
* Series is one directional

In [2]:
series = pd.Series(["BMW", "Toyota", "Honda"])

series

0       BMW
1    Toyota
2     Honda
dtype: object

In [3]:
colours = pd.Series(["Red", "Blue", "White"])

colours

0      Red
1     Blue
2    White
dtype: object

### DataFrame

* DataFrame 2 Directional

In [4]:
# We can make DataFrame using series

car_data = pd.DataFrame({"Carmake": series, "Colour": colours})

car_data

Unnamed: 0,Carmake,Colour
0,BMW,Red
1,Toyota,Blue
2,Honda,White


### Import Data
Making a dataset from scratch can be a bit tedious. So, we will import the data.

In [5]:
car_sales = pd.read_csv("car-sales.csv")

car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


   ### Export a DataFrame
  

In [6]:
car_sales.to_csv("exported-car-sales.csv")

In [7]:
exported_car_sales = pd.read_csv("exported-car-sales.csv")

exported_car_sales

Unnamed: 0.1,Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,0,Toyota,White,150043,4,"$4,000.00"
1,1,Honda,Red,87899,4,"$5,000.00"
2,2,Toyota,Blue,32549,3,"$7,000.00"
3,3,BMW,Black,11179,5,"$22,000.00"
4,4,Nissan,White,213095,4,"$3,500.00"
5,5,Toyota,Green,99213,4,"$4,500.00"
6,6,Honda,Blue,45698,4,"$7,500.00"
7,7,Honda,Blue,54738,4,"$7,000.00"
8,8,Toyota,White,60000,4,"$6,250.00"
9,9,Nissan,White,31600,4,"$9,700.00"


To fix the double index issue we use " index = False "

In [8]:
car_sales.to_csv("exported-car-sales.csv",  index = False )

In [9]:
exported_car_sales = pd.read_csv("exported-car-sales.csv")

exported_car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


## Describe Data

In [10]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price            object
dtype: object

Here dtypes is a attribute, which shows the data types of the columns.

In [11]:
car_sales.columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

In [12]:
car_columns = car_sales.columns

car_columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

Here car_sales.columns returns the column names from the car_sales dataframe.

In [13]:
car_sales.index

RangeIndex(start=0, stop=10, step=1)

In [14]:
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [15]:
car_sales.describe()

Unnamed: 0,Odometer (KM),Doors
count,10.0,10.0
mean,78601.4,4.0
std,61983.471735,0.471405
min,11179.0,3.0
25%,35836.25,4.0
50%,57369.0,4.0
75%,96384.5,4.0
max,213095.0,5.0


Here describe() is an function which shows some statistical data on the given data.

N.B. describe() only works on numerical data.

In [16]:
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           10 non-null     object
 1   Colour         10 non-null     object
 2   Odometer (KM)  10 non-null     int64 
 3   Doors          10 non-null     int64 
 4   Price          10 non-null     object
dtypes: int64(2), object(3)
memory usage: 528.0+ bytes


In [24]:
car_sales["Doors"].mean()

4.0

In [20]:
car_prices = pd.Series([3000, 1500, 111250])
car_prices.mean()

38583.333333333336

In [21]:
car_sales.sum()

Make             ToyotaHondaToyotaBMWNissanToyotaHondaHondaToyo...
Colour               WhiteRedBlueBlackWhiteGreenBlueBlueWhiteWhite
Odometer (KM)                                               786014
Doors                                                           40
Price            $4,000.00$5,000.00$7,000.00$22,000.00$3,500.00...
dtype: object

In [23]:
car_sales["Doors"].sum()

40

In [25]:
len(car_sales)

10

To find more visit: https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html

## Viewing and selecting data

In [26]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"


 head() shows top 5 rows of the dataframe. If you want to see more/less rows you can specify the number of rows you want to see. i.e head(7) will show 7 rows.

In [27]:
car_sales.tail()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


tail() shows bottom 5 rows of the dataframe. If you want to see more/less rows you can specify the number of rows you want to see. i.e head(7) will show 7 rows.

## .loc & .iloc

In [29]:
animals = pd.Series(["cat","dog", "bird","panda","snake"], index = [0,3,9,8,3])

animals

0      cat
3      dog
9     bird
8    panda
3    snake
dtype: object

In [31]:
animals.loc[3]

3      dog
3    snake
dtype: object

In [32]:
animals.loc[9]

'bird'

In [33]:
animals.iloc[3]

'panda'

In [35]:
animals.iloc[2]

'bird'

Here .loc refers to index and .iloc refers to loaction.

In [37]:
 animals.iloc[:3]

0     cat
3     dog
9    bird
dtype: object

You can use slicing with .iloc 

In [38]:
car_sales.iloc[:3]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"


In [40]:
car_sales.loc[:3]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"


In [41]:
car_sales.Make

0    Toyota
1     Honda
2    Toyota
3       BMW
4    Nissan
5    Toyota
6     Honda
7     Honda
8    Toyota
9    Nissan
Name: Make, dtype: object

In [42]:
car_sales["Make"]

0    Toyota
1     Honda
2    Toyota
3       BMW
4    Nissan
5    Toyota
6     Honda
7     Honda
8    Toyota
9    Nissan
Name: Make, dtype: object

In [43]:
car_sales.Odometer (KM)

AttributeError: 'DataFrame' object has no attribute 'Odometer'

N.B. If your column name has a space in it the dot notaion will not work.

## Boolean indexing

In [44]:
car_sales[car_sales["Make"] == "Toyota"]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
5,Toyota,Green,99213,4,"$4,500.00"
8,Toyota,White,60000,4,"$6,250.00"


In [48]:
car_sales[car_sales["Odometer (KM)"] > 100000]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
4,Nissan,White,213095,4,"$3,500.00"
