## Introduction to Pandas
**Pandas** is a Python Package that provides fast, flexible and expressive data structures to work with **relational** and **labelled** data. Most of the analysis in Data Science and Machine Learning uses **Tabular** data that may be from **csv** files or **relational databases**. Some of the advantages of Pandas are illustrated below: 
- Fast and Efficient for manipulating and analyzing data.
- Handling of missing data, data merging and joining. 
- Provides the facility of size mutability using which columns can be inserted and deleted. 
- Provides powerful groupby functionality and time-series functionality with statistical computations.

### Data Structures in Pandas
Pandas provides **Series** and **DataFrame** as the data structures for handling the data. 

- Series:
    Pandas Series is a single-dimensional array that holds data of any type including integer, string, float, python objects, etc. Pandas Series is equivalent to a single column in an excel sheet. 
- DataFrame:
    Pandas DataFrame is a two-dimensional tabular data structure with rows and columns.
    

### Installing Pandas 
The installation is simple. Use **pip** to install the package as `pip install pandas`. To install a specific version, the command has to be modified as `pip install pandas==x.x.x` where x.x.x is the version you want. For example, `pip install pandas==1.4.2` will install the **1.4.2** version of Pandas.

## Basis Usuage

### Importing Pandas
To import the pandas on any python projects, use `import pandas` command. After the import, we simply can check the pandas version installed using the command `pandas.__version__`.

In [3]:
import pandas

ImportError: DLL load failed while importing aggregations: The specified module could not be found.

In [2]:
pandas.__version__

NameError: name 'pandas' is not defined

Alternatively, the package can be imported and aliased for easy usuage. For example, pandas can also be imported as `import pandas as pd`. Now, each time we call the methods of pandas, we don't need to repeat `pandas.method_name` and can be replaced with `pd.method_name`.

In [1]:
import pandas as pd

ImportError: DLL load failed while importing aggregations: The specified module could not be found.

In [5]:
pd.__version__

'2.1.4'

### Creating Pandas Series

#### Using Numpy Arrays

##### Creating Series without index from array

In [6]:
import numpy as np

In [7]:
array_a = np.array(['I', 'Love', 'Python'])

In [8]:
series_from_array = pd.Series(array_a)
series_from_array

0         I
1      Love
2    Python
dtype: object

In [9]:
type(series_from_array)

pandas.core.series.Series

In [10]:
series_from_array[2]

'Python'

##### Creating Series with Index from array

In [11]:
series_from_array_index = pd.Series(array_a, index=['a','b','c'])
series_from_array_index

a         I
b      Love
c    Python
dtype: object

In [13]:
series_from_array_index['c']

'Python'

*Note: Series can be created from Python List similar to the above by passing the list as an argument to `pd.Series()`.*

#### Using Python Dictionary

In [10]:
dict_a = {'Ram':90, 'Hari':86.5, 'Gita':87.3}

In [11]:
series_from_dict = pd.Series(dict_a)
series_from_dict

Ram     90.0
Hari    86.5
Gita    87.3
dtype: float64

In [12]:
series_from_dict.index

Index(['Ram', 'Hari', 'Gita'], dtype='object')

In [13]:
series_from_dict.values

array([90. , 86.5, 87.3])

### Creating Python DataFrames

#### Using Python List

In [15]:
list_a = ['Python','Ruby','Rust','Java','PHP']

In [16]:
dataframe_from_list = pd.DataFrame(list_a)
dataframe_from_list

Unnamed: 0,0
0,Python
1,Ruby
2,Rust
3,Java
4,PHP


In [17]:
#to create a column name, use the following
dataframe_from_list_col_name = pd.DataFrame(list_a, columns=['Programming Languages'])
dataframe_from_list_col_name

Unnamed: 0,Programming Languages
0,Python
1,Ruby
2,Rust
3,Java
4,PHP


In [20]:
dataframe_from_list_col_name['Programming Languages'][2]

'Rust'

In [24]:
dataframe_from_list_col_name['Programming Languages'].iloc[2]

'Rust'

##### Creating DataFrame from List of list

In [17]:
list_b = [['Ram Thapa','Koteshwor'],['Nitesh Rai','London']]

In [18]:
data_frame = pd.DataFrame(list_b, columns=['Name','Address'])
data_frame

Unnamed: 0,Name,Address
0,Ram Thapa,Koteshwor
1,Nitesh Rai,London


#### Using Python Dictionary

In [19]:
dict_b = {'Name':['Ram Thapa', 'Nitesh Rai'], 'Address':['Koteshwor','London']}

In [20]:
dataframe_from_dict = pd.DataFrame(dict_b)
dataframe_from_dict

Unnamed: 0,Name,Address
0,Ram Thapa,Koteshwor
1,Nitesh Rai,London


In [21]:
#create a row label 
dataframe_from_dict_labels = pd.DataFrame(dict_b, index=['A','B'])
dataframe_from_dict_labels

Unnamed: 0,Name,Address
A,Ram Thapa,Koteshwor
B,Nitesh Rai,London


### Indexing and Slicing Pandas Series

#### Indexing by Item name

In [25]:
series_from_array_index

a         I
b      Love
c    Python
dtype: object

In [26]:
series_from_array_index['a']

'I'

In [27]:
series_from_array_index['a':'c']

a         I
b      Love
c    Python
dtype: object

#### Indexing by number (Positional value)

In [25]:
series_from_array_index[0]

'I'

In [26]:
series_from_array_index[0:2]

a       I
b    Love
dtype: object

### Reading a csv file in Pandas

In Machine Learning, mostly the datasets are available in CSV (Comma Separated Values) format. 
![CSV data format](pandas_data_frame.png)

For the experimental purpose we will use the dataset available at https://www.kaggle.com/datasets/fivethirtyeight/uber-pickups-in-new-york-city

In [7]:
dataset = pd.read_csv("data/titanic_train.csv")

NameError: name 'pd' is not defined

In [4]:
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [29]:
dataset.index

RangeIndex(start=0, stop=891, step=1)

In [8]:
dataset.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [1]:
dataset.describe()

NameError: name 'dataset' is not defined

In [10]:
type(dataset['Age'])

pandas.core.series.Series

In [11]:
dataset['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In the above result, one of the entry in the **age** column in **nan** that specifies the entry is **not a number**. In Machine Learning, as we need to deal with calculations, those should be handled using appropriate mechanisms. Some of the ideas to deal with **nan** values in dataset are as follows:
- Drop rows or columns with missing values.
- Fill the missing values with constant or scalar values.
- Fill the missing values with aggregated values like mean, median and mode of the particular column. 
- Fill the missing value with previous or next value (forward or backward fill)

In [30]:
dataset.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [13]:
dataset.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [14]:
dataset['Age'].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtype: int64

In [15]:
dataset.shape

(891, 12)

In [16]:
dataset.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

#### Filtering Rows
You can select specific rows in your data. The most basic way to do so is by giving a criterion based on a column value. If you want only the data of passengers who survived in the titanic incident, you can filter the data using the `Survived` column.

In [17]:
dataset['Survived']

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [19]:
survived_passengers = dataset[dataset['Survived'] == 1]

In [20]:
survived_passengers

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


#### Selecting Columns
You can select the columns of your interest in the dataset. If you are interested to see only the name and age of the survived passengers from above filtered data, the technique is super easy.

In [21]:
data_with_selected_columns = survived_passengers[['Name','Age']]

In [23]:
data_with_selected_columns

Unnamed: 0,Name,Age
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,"Nasser, Mrs. Nicholas (Adele Achem)",14.0
...,...,...
875,"Najib, Miss. Adele Kiamie ""Jane""",15.0
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",56.0
880,"Shelley, Mrs. William (Imanita Parrish Hall)",25.0
887,"Graham, Miss. Margaret Edith",19.0


#### Sort the dataframe
If you want to sort the above dataset according to the age, pandas provides the feasibility to sort the dataframe by some column.

In [27]:
sorted_data = data_with_selected_columns.sort_values(by=["Age"])

In [25]:
sorted_data

Unnamed: 0,Name,Age
803,"Thomas, Master. Assad Alexander",0.42
755,"Hamalainen, Master. Viljo",0.67
644,"Baclini, Miss. Eugenie",0.75
469,"Baclini, Miss. Helene Barbara",0.75
831,"Richards, Master. George Sibley",0.83
...,...,...
727,"Mannion, Miss. Margareth",
740,"Hawksford, Mr. Walter James",
828,"McCormack, Mr. Thomas Joseph",
839,"Marechal, Mr. Pierre",


The sorting can be done in ascending or the descending order and the missing values **NaN** can be placed at last or at the beginning. 

In [30]:
sorted_descending_nan_first = data_with_selected_columns.sort_values(by=["Age"], ascending=False, na_position='first')

In [31]:
sorted_descending_nan_first

Unnamed: 0,Name,Age
17,"Williams, Mr. Charles Eugene",
19,"Masselmani, Mrs. Fatima",
28,"O'Dwyer, Miss. Ellen ""Nellie""",
31,"Spencer, Mrs. William Augustus (Marie Eugenie)",
32,"Glynn, Miss. Mary Agatha",
...,...,...
831,"Richards, Master. George Sibley",0.83
644,"Baclini, Miss. Eugenie",0.75
469,"Baclini, Miss. Helene Barbara",0.75
755,"Hamalainen, Master. Viljo",0.67


### Handling missing data 

#### Drop the rows and columns with missing data

In [33]:
dataset.shape

(891, 12)

In [34]:
dataset_dropped_nan = dataset.dropna()

In [36]:
dataset_dropped_nan.shape

(183, 12)

In [37]:
dataset_dropped_nan.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False
10,False,False,False,False,False,False,False,False,False,False,False,False
11,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
871,False,False,False,False,False,False,False,False,False,False,False,False
872,False,False,False,False,False,False,False,False,False,False,False,False
879,False,False,False,False,False,False,False,False,False,False,False,False
887,False,False,False,False,False,False,False,False,False,False,False,False


#### Fill the missing value with constant or scalar value 

In [40]:
dataset_with_scalar_filled = dataset.fillna('0')

In [41]:
dataset_with_scalar_filled['Age'].unique()

array([22.0, 38.0, 26.0, 35.0, '0', 54.0, 2.0, 27.0, 14.0, 4.0, 58.0,
       20.0, 39.0, 55.0, 31.0, 34.0, 15.0, 28.0, 8.0, 19.0, 40.0, 66.0,
       42.0, 21.0, 18.0, 3.0, 7.0, 49.0, 29.0, 65.0, 28.5, 5.0, 11.0,
       45.0, 17.0, 32.0, 16.0, 25.0, 0.83, 30.0, 33.0, 23.0, 24.0, 46.0,
       59.0, 71.0, 37.0, 47.0, 14.5, 70.5, 32.5, 12.0, 9.0, 36.5, 51.0,
       55.5, 40.5, 44.0, 1.0, 61.0, 56.0, 50.0, 36.0, 45.5, 20.5, 62.0,
       41.0, 52.0, 63.0, 23.5, 0.92, 43.0, 60.0, 10.0, 64.0, 13.0, 48.0,
       0.75, 53.0, 57.0, 80.0, 70.0, 24.5, 6.0, 0.67, 30.5, 0.42, 34.5,
       74.0], dtype=object)

In [43]:
dataset['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

#### Filling the missing values with aggregated values

In [46]:
dataset_with_mean_filling = dataset.copy()

In [47]:
dataset_with_mean_filling


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [50]:
dataset_with_mean_filling['Age'].fillna(dataset_with_mean_filling['Age'].mean(), inplace=True)

In [51]:
dataset_with_mean_filling['Age'].unique()

array([22.        , 38.        , 26.        , 35.        , 29.69911765,
       54.        ,  2.        , 27.        , 14.        ,  4.        ,
       58.        , 20.        , 39.        , 55.        , 31.        ,
       34.        , 15.        , 28.        ,  8.        , 19.        ,
       40.        , 66.        , 42.        , 21.        , 18.        ,
        3.        ,  7.        , 49.        , 29.        , 65.        ,
       28.5       ,  5.        , 11.        , 45.        , 17.        ,
       32.        , 16.        , 25.        ,  0.83      , 30.        ,
       33.        , 23.        , 24.        , 46.        , 59.        ,
       71.        , 37.        , 47.        , 14.5       , 70.5       ,
       32.5       , 12.        ,  9.        , 36.5       , 51.        ,
       55.5       , 40.5       , 44.        ,  1.        , 61.        ,
       56.        , 50.        , 36.        , 45.5       , 20.5       ,
       62.        , 41.        , 52.        , 63.        , 23.5 

In [52]:
dataset['Age'].mean()

29.69911764705882

In [53]:
dataset_with_forward_fill = dataset.fillna(method='ffill')

In [55]:
dataset_with_forward_fill.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,False,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,False,False,False,False,False,False,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [59]:
dataset_with_forward_fill.shape

(891, 12)

In [61]:
dataset_with_forward_fill.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,C85,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,C123,S
5,6,0,3,"Moran, Mr. James",male,35.0,0,0,330877,8.4583,C123,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,E46,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,E46,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,E46,C


In [62]:
dataset.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


#### Pandas to deal with categorical variables
A categorical variable is a value that assumes a limited and fixed number of possible values, allowing a data unit to be assigned to a broad category for classification. 
For example: if you have a classification task to detect if the image consists of cat or not, your target variable will have two possible values i.e. **yes** or **no**. These two are the categories and thus are referred to categorical variables. 

In the above dataset, **Sex** is also a categorical variable consisting of two possible values. These categorical values cannot be directly fed to Machine Learning algorithms and thus needs transformation. They should be converted to Numerical representations. There are generally two popular methods to perform the encoding. 
- Label Encoding
- One Hot Encoding

We will cover this in more detail in the later week. But, for now let's see the pandas power to deal with such scenario.

In [64]:
pd.get_dummies(dataset,columns=["Sex"])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,1,0
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,0,1


Dummies are any variables that are either **one** or **zero** for each observation. **pd.get_dummies** when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.