# Introduction to Pandas

### References

- Official ```pandas``` documentation: [https://pandas.pydata.org/pandas-docs/stable/](https://pandas.pydata.org/pandas-docs/stable/)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)

### Importing pandas

In [1]:
import pandas as pd

### Reading csv files

In [6]:
data = pd.read_csv('Gender_Height_Weight_Index.csv')

In [7]:
type(data)

pandas.core.frame.DataFrame

Similar to ```csv``` pandas can read several other files using its inbuilt APIs.

```read_excel```: for reading ms-excel files

```read_html```: for reading html files

```read_json```: for reading json file

and many more...

### Information about the dataframe at a glance

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
Gender    500 non-null object
Height    500 non-null int64
Weight    500 non-null int64
Index     500 non-null int64
dtypes: int64(3), object(1)
memory usage: 15.8+ KB


### Viewing first few rows of the dataframe

In [10]:
data.head(100)

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3
...,...,...,...,...
95,Female,170,156,5
96,Male,142,69,4
97,Male,160,139,5
98,Male,195,69,1


### Shape of the dataframe

In [11]:
data.shape

(500, 4)

Hence, the ```data``` contains ```500``` rows and ```4``` columns

### Datatypes of each columns

In [12]:
data.dtypes

Gender    object
Height     int64
Weight     int64
Index      int64
dtype: object

### Describe the dataframe

In [13]:
data.describe()   # Only numeric columns

Unnamed: 0,Height,Weight,Index
count,500.0,500.0,500.0
mean,169.944,106.0,3.748
std,16.375261,32.382607,1.355053
min,140.0,50.0,0.0
25%,156.0,80.0,3.0
50%,170.5,106.0,4.0
75%,184.0,136.0,5.0
max,199.0,160.0,5.0


### Pandas series

```Series``` are the building blocks of pandas ```DataFrame```

Each column of a ```DataFrame``` is series

### Selcting a column

In [14]:
gender = data['Gender']  # Selecting the gender column of the dataframe

gender

0        Male
1        Male
2      Female
3      Female
4        Male
        ...  
495    Female
496    Female
497    Female
498      Male
499      Male
Name: Gender, Length: 500, dtype: object

In [15]:
type(gender)

pandas.core.series.Series

### Making a new column

In [16]:
data['Height + Weight'] = data['Height']+data['Weight']

data.head()

Unnamed: 0,Gender,Height,Weight,Index,Height + Weight
0,Male,174,96,4,270
1,Male,189,87,2,276
2,Female,185,110,4,295
3,Female,195,104,3,299
4,Male,149,61,3,210


### Removing a  column

In [12]:
data.drop('Height + Weight', axis = 1)

data.head()

Unnamed: 0,Gender,Height,Weight,Index,Height + Weight
0,Male,174,96,4,270
1,Male,189,87,2,276
2,Female,185,110,4,295
3,Female,195,104,3,299
4,Male,149,61,3,210


Notice that no change in the original dataframe because we haven't done it in ```inplace```

In [17]:
data.drop('Height + Weight', axis = 1, inplace = True) 

data.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3


Another alternative is:

``` data = data.drop('Height + Weight', axis = 1) ```

### Selecting rows

Using ``loc``

In [19]:
data.loc[2]

Gender    Female
Height       185
Weight       110
Index          4
Name: 2, dtype: object

In [15]:
data.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3


### Selecting a subset of the dataframe

In [21]:
data.loc[0:5, ['Height', 'Weight']]  # Selects the heights and weights of first 10 entries

Unnamed: 0,Height,Weight
0,174,96
1,189,87
2,185,110
3,195,104
4,149,61
5,189,104


### Conditional selection

In [22]:
data[(data['Gender'] == 'Female') & (data['Height'] > 190)]  # Returns all the female entries with height greater than 190 cm

Unnamed: 0,Gender,Height,Weight,Index
3,Female,195,104,3
12,Female,192,101,3
32,Female,195,65,1
36,Female,197,114,3
60,Female,191,54,0
69,Female,194,136,4
75,Female,197,154,4
92,Female,194,111,3
103,Female,198,145,4
104,Female,192,140,4


### Sorting

In [18]:
data.sort_values('Height')

Unnamed: 0,Gender,Height,Weight,Index
297,Female,140,76,4
421,Male,140,146,5
49,Male,140,152,5
144,Male,140,79,5
17,Male,140,129,5
151,Male,140,52,3
251,Male,140,143,5
147,Female,140,146,5
190,Male,141,85,5
72,Male,141,80,5


### Groupby operations

In [25]:
data.groupby('Gender').mean()

Unnamed: 0_level_0,Height,Weight,Index
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,170.227451,105.698039,3.709804
Male,169.64898,106.314286,3.787755


### Query operations

In [26]:
data.query('Height >= 190')

Unnamed: 0,Gender,Height,Weight,Index
3,Female,195,104,3
10,Male,195,81,2
12,Female,192,101,3
14,Male,191,79,2
26,Male,190,95,3
...,...,...,...,...
420,Female,195,61,1
469,Male,198,109,3
473,Male,195,153,5
488,Male,198,136,4
