## Table of Content

1. **[Pandas](#pandas)**

2. **[Data Structures](#structures)**
    
3. **[Pandas Series](#series)**
    - 3.1 - [Creating a Series](#creatingS)
    - 3.2 - [Manipulating Series](#manipulatingS)

4. **[Pandas Dataframes](#dataframes)**
    - 4.1 - [Creating Dataframes](#creatingDF)
    - 4.2 - [Manipulating Dataframes](#manipulatingDF)

5. **[Reading Data from Different Sources](#reading_data)**


<a id="pandas"> </a>
## 1. Pandas
#### Introduction to Pandas

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.
<br><br>
                        While Pandas adopt many coding idioms from Numpy, the biggest difference is that Pandas are designed for working with tabular or heterogeneous data. Numpy, by contrast, is best suited for working with homogeneous numerical array data.<br><br>
                         The name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

**How to install pandas?**<br>
1. You can use-<br>
`!pip install pandas`<br>
2. You can import it as-<br>
import pandas as pd

To import the Pandas library the following convention is used.

In [1]:
import pandas as pd

So from now on we will use `pd.` instead of pandas. 

<a id="structures"> </a>
## 2. Data Structures
#### Introduction to Data Structures

Pandas has two data structures as follows:<br>
1. A Series is 1-dimensional labeled array that can hold data of any type (integer, string, boolean, float, python objects, and so on). It’s axis labels are collectively called an index.<br>
2. A DataFrame is 2-dimensional labeled data structure with columns. It supports multiple datatypes.

<a id="series"> </a>
## 3. Pandas Series
#### Introduction to Pandas Series and Creating Series

Pandas Series is a one-dimensional labeled array capable of holding any data type. However, a series is a sequence of homogeneous data types, similar to an array, or column in a table.<br><br>
                        It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

<a id="creatingS"> </a>
### 3.1 Creating a Series

**1. To create a numeric series** 

In [4]:
# create a numeric series
numbers = range(1,100,5)
pd.Series(numbers)

0      1
1      6
2     11
3     16
4     21
5     26
6     31
7     36
8     41
9     46
10    51
11    56
12    61
13    66
14    71
15    76
16    81
17    86
18    91
19    96
dtype: int64

The output also gives the data type of the series as `int64`

And note that by default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

<b>*In python, the row names are called 'index'*</b>

**2. To create an object series** 

In [5]:
# create a object series
string = "Hi" , "How" ,"are", "you", "?"
pd.Series(string)

0     Hi
1    How
2    are
3    you
4      ?
dtype: object

The output gives the data type of the series as `object`.

**3. To create a series by giving both numeric and string values** 

In [6]:
# create a Series with an arbitrary list
s = pd.Series([345, 'London', 34.5, -34.45, 'Happy Birthday'])
s

0               345
1            London
2              34.5
3            -34.45
4    Happy Birthday
dtype: object

Here the numeric values are treated as object.

**4. To set index values for a series**

In [7]:
marks = [60, 89, 74, 86]

subject = ["Maths", "Science", "English" , "Social Science"]

pd.Series(marks, index = subject) 

Maths             60
Science           89
English           74
Social Science    86
dtype: int64

The index is added using the argument `index=`. The data type of the series continues to be numeric.

**5. To create a series from a dictionary**

In [8]:
data = {'Maths': 60, 'Science': 89, 'English': 76, 'Social Science': 86}

pd.Series(data)

Maths             60
Science           89
English           76
Social Science    86
dtype: int64

On passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.

**6. A series with missing values**

If we pass a key that is not defined then its value will be `NaN`.

In [9]:
subjects = ["Maths", "Science", "Art and Craft" , "Social Science"]

marks_series = pd.Series(data, index = subjects)

print(marks_series)

Maths             60.0
Science           89.0
Art and Craft      NaN
Social Science    86.0
dtype: float64


In [11]:
#index = ['Apple', 'Banana', 'Orange']
#quantity = [34, 20, 30, 40]
#pd.Series(data = quantity, index = index)

dict = {'A':30, 'B':40, 'C':50}
index = ['A', 'B', 'D']
pd.Series(data = dict, index = index)

A    30.0
B    40.0
D     NaN
dtype: float64

In [13]:
s1 = pd.Series([1, 2, 5, 6.5])
s2 = pd.Series(['first', 35, 'college', 62.5])
print(s1)
print(s2)

0    1.0
1    2.0
2    5.0
3    6.5
dtype: float64
0      first
1         35
2    college
3       62.5
dtype: object


<a id="manipulatingS"> </a>
### 3.2 Manipulating Series 
#### Manipulating series

**1. To check for null values using `.isnull`**

In [14]:
marks_series.isnull()

Maths             False
Science           False
Art and Craft      True
Social Science    False
dtype: bool

`False` indicates that the value is not null.

**2. To check for null values using `.notnull`**

In [15]:
marks_series.notnull()

Maths              True
Science            True
Art and Craft     False
Social Science     True
dtype: bool

` True` indicates that the value is not null.

**3. To know the subjects in which marks score is more than 75**

In [16]:
marks_series[marks_series > 75]

Science           89.0
Social Science    86.0
dtype: float64

**4. To assign 68 marks to 'Art and Craft'**

In [17]:
marks_series["Art and Craft"] = 68

In [18]:
marks_series

Maths             60.0
Science           89.0
Art and Craft     68.0
Social Science    86.0
dtype: float64

**5. To check whether Maths marks are 73**

In [19]:
marks_series.Maths == 73

False

In [20]:
# or you may use

marks_series["Maths"] == 73

False

**6. Sorting a numeric series**

In [21]:
# create a pandas series
import numpy as np
values = pd.Series([23, 45, np.nan, 41, 23, 34, 55, np.nan, 34, 20])
values

0    23.0
1    45.0
2     NaN
3    41.0
4    23.0
5    34.0
6    55.0
7     NaN
8    34.0
9    20.0
dtype: float64

In [22]:
# ascending order
values.sort_values(ascending = True)

9    20.0
0    23.0
4    23.0
5    34.0
8    34.0
3    41.0
1    45.0
6    55.0
2     NaN
7     NaN
dtype: float64

In [23]:
# descending order
values.sort_values(ascending = False)

6    55.0
1    45.0
3    41.0
8    34.0
5    34.0
4    23.0
0    23.0
9    20.0
2     NaN
7     NaN
dtype: float64

**7. Sorting a categorical series**

In [24]:
# create a pandas series
string_values = pd.Series(["a", "j", "d", "f", "t", "a"])

string_values

0    a
1    j
2    d
3    f
4    t
5    a
dtype: object

In [25]:
# ascending order
string_values.sort_values(ascending = True)

0    a
5    a
2    d
3    f
1    j
4    t
dtype: object

In [26]:
# descending order
string_values.sort_values(ascending = False)

4    t
1    j
3    f
2    d
5    a
0    a
dtype: object

**8. Rank a Series**

In [27]:
# recall the marks_series
marks_series.rank( ascending=True, pct=False)

Maths             1.0
Science           4.0
Art and Craft     2.0
Social Science    3.0
dtype: float64

In [28]:
data = [0.85, 0.8, 0.98, 0.74, 0.4, 0.55, 0.94, 0.42, 0.43, 0.92]
ser = pd.Series(data=data) 
ser.sort_values(ascending = False)

2    0.98
6    0.94
9    0.92
0    0.85
1    0.80
3    0.74
5    0.55
8    0.43
7    0.42
4    0.40
dtype: float64

In [29]:
data = range(10)
new_ser = pd.Series(data = data)
new_ser[new_ser == 5]

5    5
dtype: int64

<a id="dataframes"> </a>
## 4. Pandas Dataframes
#### Introduction to Dataframes and Creating Dataframes

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> A Dataframe is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numeric, string, boolean, and so on). <br><br>
                        The Dataframe has both a row and column index; it can be thought of as a dict of Series all sharing the same index. In a dataframe, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. 
<br><br>
                        While a Dataframe is physically two-dimensional, it can be used to represent higher dimensional data in a tabular format using hierarchical indexing.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<a id="creatingDF"> </a>
### 4.1 Creating Dataframes

**1. Creating a dataframe a dictionary**

In [33]:
data = {'Subject': ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'],
        'Marks': (45, 65, 78, 65, 80, 78),
        'CGPA': [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]}

df = pd.DataFrame(data)
print(df)

    Subject  Marks  CGPA
0     Maths     45   2.5
1   History     65   3.0
2   Science     78   3.5
3   English     65   2.0
4  Georaphy     80   4.0
5       Art     78   4.0


**Note:** Like Series, the resulting Dataframe is assigned index automatically. The 'Marks' values are in a tuple. 

**2. To create dataframe from series**

In [34]:
Subject = pd.Series(['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'])
Marks = pd.Series([45, 65, 78, 65, 80, 78])
CGPA = pd.Series([2.5, 3.0, 3.5, 2.0, 4.0, 4.0])

In [35]:
pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA'])

Unnamed: 0,0,1,2,3,4,5
Subject,Maths,History,Science,English,Georaphy,Art
Marks,45,65,78,65,80,78
CGPA,2.5,3,3.5,2,4,4


However to want a vertical dataframe so we use `.T`. The 'T' stands for transpose.

In [36]:
pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**Remark:** Assign a name to the dataframe and then use `.T` to transpose it.

**4. To create dataframe from lists**

In [37]:
Subject = ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art']
Marks = [45, 65, 78, 65, 80, 78]
CGPA = [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]

In [38]:
pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**5. To read data from csv file**

In [39]:
data = pd.read_csv("ex1.csv")

In [40]:
type(data)

pandas.core.frame.DataFrame

On checking the data type, we notice it is read as a pandas data frame.

In [41]:
print(data)

    Age  Weight (in kg)  Height (in m)
0    45              60           1.35
1    12              43           1.21
2    54              78           1.50
3    26              65           1.21
4    68              50           1.32
5    21              43           1.52
6    10              32           1.65
7    57              34           1.61
8    75              23           1.24
9    32              21           1.52
10   23              53           1.50
11   34              65           1.76
12   55              89           1.65
13   23              45           1.75
14   56              76           1.69
15   67              78           1.85
16   26              65           1.21
17   56              74           1.69
18   67              78           1.85
19   26              65           1.21
20   68              50           1.32
21   56              76           1.69
22   67              78           1.85


**6. To print head of the data**

In [42]:
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32


By default, the `.head()` will display **first** five rows. However, we can set the desired number of rows to be displayed.

Say we want to see the first 9 rows, we write the number 9 in the parentheses.

In [43]:
data.head(9)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32
5,21,43,1.52
6,10,32,1.65
7,57,34,1.61
8,75,23,1.24


**7. To print tail of the data**

In [44]:
data.tail()

Unnamed: 0,Age,Weight (in kg),Height (in m)
18,67,78,1.85
19,26,65,1.21
20,68,50,1.32
21,56,76,1.69
22,67,78,1.85


By default, the `.tail()` will display **last** five rows. However, we can set the desired number of rows to be displayed.

If we want to see the last 14 rows, we write the number 14 in the parentheses.

In [45]:
data.tail(14)

Unnamed: 0,Age,Weight (in kg),Height (in m)
9,32,21,1.52
10,23,53,1.5
11,34,65,1.76
12,55,89,1.65
13,23,45,1.75
14,56,76,1.69
15,67,78,1.85
16,26,65,1.21
17,56,74,1.69
18,67,78,1.85


**8. To obtain the dimension of the data**

In [46]:
data.shape

(23, 3)

**9. To know the data types of a dataframe**

In [47]:
data.dtypes

Age                 int64
Weight (in kg)      int64
Height (in m)     float64
dtype: object

We can see the data type of each variable in the dataframe.

**10. To know some information of the data**

In [48]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             23 non-null     int64  
 1   Weight (in kg)  23 non-null     int64  
 2   Height (in m)   23 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 680.0 bytes


We see this output gives the number of rows present in the data `RangeIndex: 23 entries, 0 to 22` There are 23 rows numbered from 0 to 22. And there are a total of three columns - `Data columns (total 3 columns)`. 

Consider `Age 23 non-null int64` indicates that the column named 'Age' has 23 non-null observations having the data type 'int64'

And finally the memory used to save this dataframe is 680 bytes.

**11. To check the data type of column in the data frame**

In [49]:
type(data.Age)

pandas.core.series.Series

In [50]:
type(data["Weight (in kg)"])

pandas.core.series.Series

In [51]:
type(data["Height (in m)"])

pandas.core.series.Series

In [54]:
data.Age.dtype

dtype('int64')

In [59]:
a1 = ['Hogwarts', 'Durmstrang', 'Beauxbatons']
a2 = ['Hogwarts', 'Durmstrang', 'Beauxbatons']
a3 = ['Hogwarts', 'Durmstrang', 'Beauxbatons']
school = [a1, a2, a3]
inst = ['School_1', 'School_2', 'School_3']
Muggle_data = pd.DataFrame(data = school, columns = inst)
Muggle_data

Unnamed: 0,School_1,School_2,School_3
0,Hogwarts,Durmstrang,Beauxbatons
1,Hogwarts,Durmstrang,Beauxbatons
2,Hogwarts,Durmstrang,Beauxbatons


**Note that every column of the dataframe is a pandas Series.**

<a id="manipulatingDF"> </a>
### 4.2  Manipulating Dataframes 
#### Manipulating the Dataframes

### Add new column and rows

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> CAUTION:<br>
                        1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.<br>
                        2. New columns cannot be created with the ` data.BMI ` syntax.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

**1. Adding a new column to the data set**

In [60]:
data["BMI"] = data["Weight (in kg)"] / data["Height (in m)"]**2

In [61]:
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051
5,21,43,1.52,18.611496
6,10,32,1.65,11.753903
7,57,34,1.61,13.116778
8,75,23,1.24,14.958377
9,32,21,1.52,9.089335


In [62]:
data.shape

(23, 4)

**2. Adding a new row to the data set**

A new row can be added using the function copy()

In [63]:
data_copy = data.copy()
data_copy.loc[23] = [45, 85, 1.8, 26.3]

In [65]:
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051
5,21,43,1.52,18.611496
6,10,32,1.65,11.753903
7,57,34,1.61,13.116778
8,75,23,1.24,14.958377
9,32,21,1.52,9.089335


In [66]:
data_copy

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


We see that a new column number 23 has be added to the data.

**3. Indexing a dataframe using `.iloc`**

`DataFrame.iloc[]` method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. 

We shall work on the BMI data set.

#### Select the 2nd row

In [67]:
data.iloc[2]

Age               54.000000
Weight (in kg)    78.000000
Height (in m)      1.500000
BMI               34.666667
Name: 2, dtype: float64

In [68]:
data.loc[2]

Age               54.000000
Weight (in kg)    78.000000
Height (in m)      1.500000
BMI               34.666667
Name: 2, dtype: float64

#### Select 4th, 7th and 10th rows

In [69]:
data.iloc[[4,7,10]]

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
4,68,50,1.32,28.696051
7,57,34,1.61,13.116778
10,23,53,1.5,23.555556


We use two square brackets since we are passing a list of row numbers to be accessed.

#### Select 12th to 16th rows

In [70]:
data.iloc[12:17]

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
12,55,89,1.65,32.690542
13,23,45,1.75,14.693878
14,56,76,1.69,26.609713
15,67,78,1.85,22.790358
16,26,65,1.21,44.395875


#### Select the 1st column

In [71]:
data.iloc[:, 1]

0     60
1     43
2     78
3     65
4     50
5     43
6     32
7     34
8     23
9     21
10    53
11    65
12    89
13    45
14    76
15    78
16    65
17    74
18    78
19    65
20    50
21    76
22    78
Name: Weight (in kg), dtype: int64

#### Select the last column

In [72]:
data.iloc[:,-1]

0     32.921811
1     29.369579
2     34.666667
3     44.395875
4     28.696051
5     18.611496
6     11.753903
7     13.116778
8     14.958377
9      9.089335
10    23.555556
11    20.983988
12    32.690542
13    14.693878
14    26.609713
15    22.790358
16    44.395875
17    25.909457
18    22.790358
19    44.395875
20    28.696051
21    26.609713
22    22.790358
Name: BMI, dtype: float64

To select the last column we use -1, to select the second last column we use -2.

#### Select the first two columns

In [73]:
data.iloc[:,0:2]

Unnamed: 0,Age,Weight (in kg)
0,45,60
1,12,43
2,54,78
3,26,65
4,68,50
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21


#### Select the first two columns and 5 to 10 rows

In [74]:
data.iloc[5:11, 0:2]

Unnamed: 0,Age,Weight (in kg)
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21
10,23,53


**4. Indexing a dataframe using `.loc`**

`DataFrame.loc[]` method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller dataframe. <br>
`DataFrame.loc[Row_names, column_names]` is used to select or index rows or columns based on their name.

#### Select 1 to 5 rows and 2nd and 4th columns

In [75]:
data.loc[1:5,["Weight (in kg)","BMI"]]

Unnamed: 0,Weight (in kg),BMI
1,43,29.369579
2,78,34.666667
3,65,44.395875
4,50,28.696051
5,43,18.611496


**Note:** The row names are numbers.

**5. Selecting columns by specifying column names**

#### Select the column 'Age'

In [76]:
data.Age

0     45
1     12
2     54
3     26
4     68
5     21
6     10
7     57
8     75
9     32
10    23
11    34
12    55
13    23
14    56
15    67
16    26
17    56
18    67
19    26
20    68
21    56
22    67
Name: Age, dtype: int64

**Remark:** Using this method we can select only one column.

In [77]:
# OR
data["Age"]

0     45
1     12
2     54
3     26
4     68
5     21
6     10
7     57
8     75
9     32
10    23
11    34
12    55
13    23
14    56
15    67
16    26
17    56
18    67
19    26
20    68
21    56
22    67
Name: Age, dtype: int64

#### Select the column 'Age' and 'BMI'

In [78]:
data[["Age","BMI"]]

Unnamed: 0,Age,BMI
0,45,32.921811
1,12,29.369579
2,54,34.666667
3,26,44.395875
4,68,28.696051
5,21,18.611496
6,10,11.753903
7,57,13.116778
8,75,14.958377
9,32,9.089335


**6. Sort the data frame on the basis of values in a column**

Each column of a pandas DataFrame is treated as a pandas Series. The `.sort_values()` in DataFrames works similar to the `pandas.Series`.

In [79]:
# print head() of 'data'
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051


In [80]:
# sort the data frame on basis of 'Age' values
# by default the values will get sorted in ascending order
data.sort_values('Age')

#Note: 'ascending = False' will sort the data frame in descending order

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
6,10,32,1.65,11.753903
1,12,43,1.21,29.369579
5,21,43,1.52,18.611496
13,23,45,1.75,14.693878
10,23,53,1.5,23.555556
19,26,65,1.21,44.395875
3,26,65,1.21,44.395875
16,26,65,1.21,44.395875
9,32,21,1.52,9.089335
11,34,65,1.76,20.983988


**7. Rank the dataframe**

In [81]:
# rank the data frame 'data' in descending order based on 'BMI'
# 'method = min' assigns the minimum rank to highest equal value of 'BMI' 
data['BMI_ranked'] = data['BMI'].rank(ascending = 0, method  = 'min')
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI,BMI_ranked
0,45,60,1.35,32.921811,5.0
1,12,43,1.21,29.369579,7.0
2,54,78,1.5,34.666667,4.0
3,26,65,1.21,44.395875,1.0
4,68,50,1.32,28.696051,8.0
5,21,43,1.52,18.611496,18.0
6,10,32,1.65,11.753903,22.0
7,57,34,1.61,13.116778,21.0
8,75,23,1.24,14.958377,19.0
9,32,21,1.52,9.089335,23.0


From the above data frame, we can see that 'BMI = 44.395875' is repeating thrice; thus the method = 'min' will assign the minimum rank (=1) to all the three values of BMI. The rank '4' will be assigned to the second largest value of BMI and so on. Thus, there is no rank equal to 2 and 3.

In [82]:
# method = 'dense' assigns same rank to all the same BMI values
data['BMI_densed_rank'] = data['BMI'].rank(method = 'dense')
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI,BMI_ranked,BMI_densed_rank
0,45,60,1.35,32.921811,5.0,15.0
1,12,43,1.21,29.369579,7.0,13.0
2,54,78,1.5,34.666667,4.0,16.0
3,26,65,1.21,44.395875,1.0,17.0
4,68,50,1.32,28.696051,8.0,12.0
5,21,43,1.52,18.611496,18.0,6.0
6,10,32,1.65,11.753903,22.0,2.0
7,57,34,1.61,13.116778,21.0,3.0
8,75,23,1.24,14.958377,19.0,5.0
9,32,21,1.52,9.089335,23.0,1.0


Here, dense method assigns minimum rank (=1) to minimum value (=9.089335) of the BMI. Rank 2 will be assigned to BMI value greater than min=9.089335 and so on. Thus, no rank is skipped in the dense method.

**8. To check for missing values**

We shall import a new dataset.

In [83]:
missing_data = pd.read_csv("example_missingdata.csv")
missing_data

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45.0,60.0,1.35
1,12.0,43.0,1.21
2,54.0,78.0,1.5
3,26.0,65.0,1.21
4,68.0,50.0,1.32
5,21.0,,1.52
6,10.0,32.0,1.65
7,57.0,34.0,1.61
8,75.0,23.0,1.24
9,32.0,21.0,1.52


In [84]:
missing_data.isnull().sum()

Age               1
Weight (in kg)    2
Height (in m)     1
dtype: int64

The function `.isnull` checks whether the data is missing. The `sum()` function sums the number of 'True' values in the column. The final output gives the number of missing values in each column.

Here, we can see there are 2 missing values in the 'weight' column and one missing value in other columns.

In [85]:
data = {'A':[1,2,3,4,5], 'B':[1,0,1,1,0]}
df = pd.DataFrame(data = data)
#df.C = df.A + df.B

  This is separate from the ipykernel package so we can avoid doing imports until


In [87]:
calorie_data = pd.DataFrame({'day': ['day1','day2','day3','day4','day5']
                           ,'calories': [450, 300, 345, 520, 600]
                           ,'duration_min': [30, 25, 29, 39, 48]
                           })
calorie_data
calorie_data.loc[0]

day             day1
calories         450
duration_min      30
Name: 0, dtype: object

In [91]:
s1 = pd.Series( range(0,5) )
s2 = pd.Series( range(5,10))
s3 = pd.Series( range(10,15))
s4 = pd.Series( range(15,20))
thedf = pd.DataFrame( [s1,s2,s3,s4] )
print(thedf)

    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19


In [101]:
# reverse rows
# thedf.iloc[::-1,:] # works
# thedf.loc[thedf.index[::-1],:] # works
thedf.iloc[thedf.index[::-1],:] # works

Unnamed: 0,0,1,2,3,4
3,15,16,17,18,19
2,10,11,12,13,14
1,5,6,7,8,9
0,0,1,2,3,4


In [96]:
calorie_data = pd.DataFrame({'day': ['day1','day2','day3','day4','day5']
                           ,'calories': [450, 300, 345, 520, 600]
                           ,'duration_min': [30, 25, 29, 39, 48]
                           })
calorie_data

Unnamed: 0,day,calories,duration_min
0,day1,450,30
1,day2,300,25
2,day3,345,29
3,day4,520,39
4,day5,600,48


In [97]:
calorie_data.loc[[1,2],['calories']]

Unnamed: 0,calories
1,300
2,345


In [98]:
calorie_data.iloc[[1,2],[1]]

Unnamed: 0,calories
1,300
2,345


<a id="reading_data"> </a>
## 5.  Reading Data from Different Sources
#### Reading Data From Different Sources
Note: The files names are used as examples only. You can try importing your own files to execute the below examples.

**1. Read a `.xlsx` file**

In [102]:
pd.read_excel('ex1.xlsx')

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

**2. Read a `.zip` file**

In [104]:
import zipfile
with zipfile.ZipFile('data.zip') as z:
    with z.open('example.csv') as f:
        file = pd.read_csv(f)
        print(file.head())

   Age  Weight (in kg)  Height (in m)
0   45              60           1.35
1   12              43           1.21
2   54              78           1.50
3   26              65           1.21
4   68              50           1.32


**3. Read a `.html` file**

df = pd.read_html('Sheet1.html', header=1, index_col=0)

**4. Read a `.txt` file**

In [106]:
data = pd.read_csv('ex1.txt', sep="\t")
data.head()

Unnamed: 0,Country,Birth rate,Life expectancy
0,Vietnam,1.822,74.828244
1,Vanuatu,3.869,70.819488
2,Tonga,3.911,72.150659
3,Timor-Leste,5.578,61.999854
4,Thailand,1.579,73.927659


**5. Read a `.json` file**

In [107]:
pd.read_json('iris.json')

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


**6. Read a `.xml` file**