# Pandas
---
- Provides high performance, easy-to-use data structures and analysis tools for the python programming language.
- Its an open-source python library providing high performance data manipulation and analysis tool using its powerful data structures.
- The name "*Pandas*" is derived from the word "*Panel Data" (An econometrics term for multidimensional data)*".
- Pandas delas with dataframe.



## Dataframes
---
- Dataframes are two-dimensional arrays with mutable size.
- It is a potentially heterogeneous tabular data structure with labelled axes (rows and columns).

## Importing Data
---
To perform various operations on the dataframe, first we need to import the data as follows :

In [None]:
import pandas as pd
Toyota = "/content/drive/My Drive/Colab Notebooks/Datasets/Toyota.csv"
cars_data = pd.read_csv(Toyota, index_col = 0)
cars_data

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986,Diesel,90,1.0,0,2000,three,1165
1,13750,23.0,72937,Diesel,90,1.0,0,2000,3,1165
2,13950,24.0,41711,Diesel,90,,0,2000,3,1165
3,14950,26.0,48000,Diesel,90,0.0,0,2000,3,1165
4,13750,30.0,38500,Diesel,90,0.0,0,2000,3,1170
...,...,...,...,...,...,...,...,...,...,...
1431,7500,,20544,Petrol,86,1.0,0,1300,3,1025
1432,10845,72.0,??,Petrol,86,0.0,0,1300,3,1015
1433,8500,,17016,Petrol,86,0.0,0,1300,3,1015
1434,7250,70.0,??,,86,1.0,0,1300,3,1015


By passing the `index_col=0` argument, we get rid of the default index column that comes along with reading the data and made the first column of our dataset as the index column.

In the above dataset, the column names describes the following attributes :

1. Price : Price of the pre-owned cars
2. Age : Age of the car (In months)
3. KM : How many kilometers that the car has travelled
4. FuelType : Type of fuel used in the car
5. HP : Horse Power of the car
6. MetColor : Whether the car has a mettalic color or, not.
7. Automatic : Whether the gearbox of the car is automatic or, mannual.
8. CC : Size of the car engine in cubic centimeters
9. Doors : No. of doors present in the car.
10. Weight : The weight of the car



## Creating Copy Of Original Data
---
In python, there are 2 ways to create copies of the dataset :
1. Shallow Copy
2. Deep Copy

### 1. Shallow Copy
---
Its syntax is :
```python
shallow_copy = dataset.copy(deep = FALSE)
```
- It only creates a new variable that shares the reference of the original object.
- Any changes made to the copy of the object will be reflected in the original object as well.


### 2. Deep Copy
---
Its syntax is :
```python
deep_copy = dataset.copy(deep = TRUE)
```
- Even if we don't pass `(deep=TRUE)` argument, we will still get a deep copy because, `(deep = TRUE)` is the default argument of `.copy`.
- In case of deep copy, a copy of the object is copies in other object with no reference to the original.
- Any changes made to the copy of the object will not be reflected in the original object.

## Indexing & Selecting Data
---
- Python slicing operator (`[]`) and dot/attribute operator (`.`) are used for indexing.
- Indexing provides quick and easy access to pandas data structure.

- `dataframe.index` : To get the row labels (index) of the dataframe.

In [None]:
cars_data.index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435],
           dtype='int64', length=1436)

- `dataframe.columns` : To get the column labels of the dataframe.

In [None]:
cars_data.columns

Index(['Price', 'Age', 'KM', 'FuelType', 'HP', 'MetColor', 'Automatic', 'CC',
       'Doors', 'Weight'],
      dtype='object')

- `dataframe.size` : To get the total number of elements (rows x columns) present in the dataframe.

In [None]:
cars_data.size

14360

- `dataframe.shape` : To get the dimensionality (rows, columns) of the dataframe.

In [None]:
cars_data.shape

(1436, 10)

- `dataframe.memory_usage([index,deep])` : To get the memory usage of each column in bytes.

In [None]:
cars_data.memory_usage()

Index        11488
Price        11488
Age          11488
KM           11488
FuelType     11488
HP           11488
MetColor     11488
Automatic    11488
CC           11488
Doors        11488
Weight       11488
dtype: int64

- `dataframe.ndim` : To get the number of axes/array dimensions (For dataframes its always 2)

In [None]:
cars_data.ndim

2

- `dataframe.head([n])` : Returns the first n-rows from the dataframe.
- By default, the `head()` returns first 5 rows.

In [None]:
cars_data.head(8)

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986,Diesel,90,1.0,0,2000,three,1165
1,13750,23.0,72937,Diesel,90,1.0,0,2000,3,1165
2,13950,24.0,41711,Diesel,90,,0,2000,3,1165
3,14950,26.0,48000,Diesel,90,0.0,0,2000,3,1165
4,13750,30.0,38500,Diesel,90,0.0,0,2000,3,1170
5,12950,32.0,61000,Diesel,90,0.0,0,2000,3,1170
6,16900,27.0,??,Diesel,????,,0,2000,3,1245
7,18600,30.0,75889,,90,1.0,0,2000,3,1245


- `dataframe.tail([n])` : Returns the last n-rows from the dataframe.
- By default, the `tail()` returns last 5 rows.

In [None]:
cars_data.tail()

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
1431,7500,,20544,Petrol,86,1.0,0,1300,3,1025
1432,10845,72.0,??,Petrol,86,0.0,0,1300,3,1015
1433,8500,,17016,Petrol,86,0.0,0,1300,3,1015
1434,7250,70.0,??,,86,1.0,0,1300,3,1015
1435,6950,76.0,1,Petrol,110,0.0,0,1600,5,1114


- Both `head()` and `tail()` functions are useful for quickly verifying data especially, after sorting or, appending data.

- To access a scalar value, the fastest way is to use the `at` and `iat` methods.
  - `at` provides label-based scalar lookups.</br>
  We need to provide the row index number and column name to get the element.
  - `iat` provides integer-based lookups.</br>
  We need to provide the index values of row and column to get the element.

In [None]:
cars_data.at[4,"FuelType"]

'Diesel'

In [None]:
cars_data.iat[5,6]

0

- To access a group of rows and columns by label(s), `.loc[]` can be used. 

In [None]:
cars_data.loc[:,"FuelType"]

0       Diesel
1       Diesel
2       Diesel
3       Diesel
4       Diesel
         ...  
1431    Petrol
1432    Petrol
1433    Petrol
1434       NaN
1435    Petrol
Name: FuelType, Length: 1436, dtype: object

## Datatypes
---
- The way information gets stored in a dataframe or, a python object affects the analysis and outputs of calculations.
- There are two main types of datatypes
  - Numeric Type
  - Character Type

### Numeric Type
---
Numeric type includes integer and floats.</br>
***Example :***</br>
*Integer => 10</br>
float => 10.53*

### Character Type
---
Strings are known as object in Pandas which can store values that contain numbers and/or characters.</br>
***Example :***</br>
*String => :"Category01"*

#### Difference Between *Category* and *Object*
---
Category | Object |
---|---|
A string variable consisting of only a few different values.</br> Converting such a string variable to categorical variable will save some memory.|The column will be assigned as the object datatype when it has mixed types(Numbers & strings).</br> If a column contains `NaN`values then pandas will default to object datatype.|
A categorical variable takes on a limited, fixed number of possible values|For strings, the length is not fixed.

- `nbytes()` : Used to get the total bytes consumed by the elements of the columns.

***Syntax***
```python
ndarray.nbytes
```
Let's see the impact on the space consumption, when we have a column as object datatype and as category datatype.

Let's take the `FuelType` column for testing

In [None]:
cars_data["FuelType"].nbytes #With Object Datatype

11488

In [None]:
cars_data["FuelType"].astype("category").nbytes #With Category Datatypes

1460

So, as a colnclusion we can say that, when we deal with large amount of data then, we must convert all the categorical variables (string data) to "Category" datatype.

- Base python and pandas uses different names for the same datatypes.


Python data Type | Pandas Data Type | Description |
---|---|---|
int|int64|Numeric Characters|
float|float64|Numeric Characters with decimals|

- *64* simply refers to the memory allocated to store the data in each cell which effectively relates to how many digits it can store in each *cell*.
- *64 bits* is equivalent to *8 bytes*.

### Checking Datatype Of Each Column
---
- `dataframe.dtypes` : Returns a series with datatype of all the columns.


In [None]:
cars_data.dtypes

Price          int64
Age          float64
KM            object
FuelType      object
HP            object
MetColor     float64
Automatic      int64
CC             int64
Doors         object
Weight         int64
dtype: object

- `dataframe.dtypes.value_counts()`: Returns counts of unique datatypes in the dataframe.

In [None]:
cars_data.dtypes.value_counts()

object     4
int64      4
float64    2
dtype: int64

### Selecting Data Based On Datatypes
---

- `dataframe.select_dtypes(include = None, exclude = None)`: Returns a subset of the columns from dataframe based on the column datatype.

In [None]:
cars_data.select_dtypes(exclude=[object])

Unnamed: 0,Price,Age,MetColor,Automatic,CC,Weight
0,13500,23.0,1.0,0,2000,1165
1,13750,23.0,1.0,0,2000,1165
2,13950,24.0,,0,2000,1165
3,14950,26.0,0.0,0,2000,1165
4,13750,30.0,0.0,0,2000,1170
...,...,...,...,...,...,...
1431,7500,,1.0,0,1300,1025
1432,10845,72.0,0.0,0,1300,1015
1433,8500,,0.0,0,1300,1015
1434,7250,70.0,1.0,0,1300,1015


### Concise Summary Of Dataframe
---
- `info()` : Returns a concise summary of a dataframe. This includes :
  - Datatype of index
  - Datatype of columns
  - Count of non-null values
  - Memory Usage

- ***Syntax :***
```python
dataframe.info()
```

In [None]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
Price        1436 non-null int64
Age          1336 non-null float64
KM           1436 non-null object
FuelType     1336 non-null object
HP           1436 non-null object
MetColor     1286 non-null float64
Automatic    1436 non-null int64
CC           1436 non-null int64
Doors        1436 non-null object
Weight       1436 non-null int64
dtypes: float64(2), int64(4), object(4)
memory usage: 123.4+ KB


### Analysing The Format Of Each Column
---
By using `info()`, we can see  :
- `KM` has been read as "Object" instead of "Integer"
- `HP` has been read as "Object" instead of "Integer"
- `MetColor` has been read as "float64" since it has values 0/1.
- `Automatic` has been read as "int64" since it has values 0/1.
- `Doors` has been read as "Object" instead of "Integer".

So, let's see the reason behind and resolve them.


In [6]:
import numpy as np
print(np.unique(cars_data["KM"]))

[  1.  15. 225. ...  nan  nan  nan]


In [None]:
print(np.unique(cars_data["HP"]))

['107' '110' '116' '192' '69' '71' '72' '73' '86' '90' '97' '98' '????']


In [None]:
print(np.unique(cars_data["MetColor"]))

[ 0.  1. nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan]


In [None]:
print(np.unique(cars_data["Automatic"]))

[0 1]


In [None]:
print(np.unique(cars_data["Doors"]))

['2' '3' '4' '5' 'five' 'four' 'three']


- `KM` has a special character `??` and hence read as "Object" instead of "Int64".
- `HP` has a special character `????` and hence read as "Object" instead of "Int64".
- `MetColor` has `1.` and `0.` instead of `1` and `0` and thus read as `float64`.
- `Automatic` has `1` and `0` and thus read as `float64`.
- `Doors` has numbers written in text and hence read as "Object".

Let's import the data by converting the triggered special characters to `NaN` values then create a deep copy of the original data for further analysis.

In [None]:
Toyota = "/content/drive/My Drive/Colab Notebooks/Datasets/Toyota.csv"
cars_data = pd.read_csv(Toyota, index_col = 0,na_values=["??","????"])
Cars = cars_data.copy(deep=True)
Cars.head()

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90.0,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90.0,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90.0,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90.0,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90.0,0.0,0,2000,3,1170


Now let's check the datatype of `KM` and `HP` columns and if we have done everything correct then they will be of "float64" datatypes.

In [None]:
Cars["KM"].dtypes

dtype('float64')

In [None]:
Cars["HP"].dtypes

dtype('float64')

### Converting Datatype Of Variables
---
- `MetColor` and `Automatic` are categorical data but as they have `1` and `0`, so they have been read as float and integer type respectively.
- `astype()` : Used to explicitly conver the datatypes from one to another.

***Syntax :***
```python
dataframe[columnname].astype(dtype)
```


In [None]:
Cars["MetColor"] = Cars["MetColor"].astype("object")
Cars["Automatic"] = Cars["Automatic"].astype("object")

Cars.dtypes

Price          int64
Age          float64
KM           float64
FuelType      object
HP           float64
MetColor      object
Automatic     object
CC             int64
Doors         object
Weight         int64
dtype: object

### Replacing Values In A Dataframe
---
We have seen that the `Doors` variables has some numbers written instead of the numbers. So, to replace these values, we can use the `replace()` function.

***Syntax :***
```python
dataframe.replace([to_replace, value,...])
```

In [None]:
Cars['Doors'].replace("three", 3 , inplace = True)
Cars['Doors'].replace("four", 4 , inplace = True)
Cars['Doors'].replace("five", 5, inplace = True)

Now, lets see the unique values present in the `Doors` column :

In [11]:
Cars["Doors"] = Cars["Doors"].astype('int64')
Cars["Doors"].dtype

dtype('int64')

In [12]:
print(np.unique(Cars["Doors"]))

[2 3 4 5]


### Detecting Missing Values In A Dataframe
---
To check the count of missing values present in each column, we use `Dataframe.isnull.sum()` command.

Let's check this in our data frame :

In [13]:
Cars.isnull().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64