---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.12(Pandas-03)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Overview of Pandas Dataframe Data Structure_

#### Read about Pandas Data Structures: https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro

## Learning agenda of this notebook
1. Anatomy of a Dataframe
2. Creating Dataframe
    - An empty dataframe
    - Two-Dimensional NumPy Array
    - Dictionary of Python Lists
    - Dictionary of Panda Series
2. Attributes of a Dataframe
3. Bonus

In [1]:
# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install pandas

In [2]:
import pandas as pd
pd.__version__ , pd.__path__

('1.4.2', ['/home/dell/.local/lib/python3.8/site-packages/pandas'])

<img align="right" width="500" height="500"  src="images/dataframe.webp">


## 1. Creating a Dataframe
<br><br>
>**A Pandas Dataframe is a two-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**

<br><br><br><br>

**```pd.DataFrame(data=None, index=None, columns=None, dtype=None)```**
- Where,
   - `data`: It can be a 2-D NumPy Array, a Dictionary of Python Lists, or a Dictionary of Panda Series (You can also create a dataframe from a file in CSV, Excel, JSON, HTML format or may be from a database table as well).
   - `index`: These are the row indices. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data.
   - `columns`: These are the column indices or labels. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data.
   - `dtype`: Data type to force. Only a single dtype is allowed. If None, infer.

### a. Creating an Empty Dataframe

In [3]:
import pandas as pd
import numpy as np
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


### b. Creating a Dataframe from a 2-D NumPy Array

In [4]:
arr = np.random.randint(10,100, size= (6,5))
print("Numpy Array:\n",arr)



Numpy Array:
 [[62 84 41 28 50]
 [45 97 44 54 99]
 [58 74 98 33 94]
 [19 32 50 68 26]
 [46 64 33 31 69]
 [19 73 20 36 57]]


In [5]:
df = pd.DataFrame(data=arr)
print("Pandas Dataframe:\n")
df

Pandas Dataframe:



Unnamed: 0,0,1,2,3,4
0,62,84,41,28,50
1,45,97,44,54,99
2,58,74,98,33,94
3,19,32,50,68,26
4,46,64,33,31,69
5,19,73,20,36,57


- Note that both the row indices and the column labels/indices are implicitly set to numerical values from 0 to n-1, since neither of the two is provided while creating the dataframe object. They are also not considered as part of data in the dataframe.
- In majority of the cases the row label is left as default, i.e., 0,1,2,3.... However, the column labels are changed from 0,1,2,3,... to some meaningful values.

In [6]:
# Let us name the column labels of our choice, while creating it
col_labels=['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
df = pd.DataFrame(data=arr, columns=col_labels)
df

Unnamed: 0,Col1,Col2,Col3,Col4,Col5
0,62,84,41,28,50
1,45,97,44,54,99
2,58,74,98,33,94
3,19,32,50,68,26
4,46,64,33,31,69
5,19,73,20,36,57


In [7]:
# Let us name the row labels of our choice, while creating it
df = pd.DataFrame(data=arr, index=['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5'])
df

Unnamed: 0,0,1,2,3,4
Row0,62,84,41,28,50
Row1,45,97,44,54,99
Row2,58,74,98,33,94
Row3,19,32,50,68,26
Row4,46,64,33,31,69
Row5,19,73,20,36,57


In [8]:
# Let us name the both row labels and column labels to strings of our choice, while creating it
row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']
col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4']
df = pd.DataFrame(data=arr, index=row_labels, columns=col_labels)
df

Unnamed: 0,Col0,Col1,Col2,Col3,Col4
Row0,62,84,41,28,50
Row1,45,97,44,54,99
Row2,58,74,98,33,94
Row3,19,32,50,68,26
Row4,46,64,33,31,69
Row5,19,73,20,36,57


- You can do this later as well, i.e., after the dataframe has been created with default indices.
- This is done by assigning a list of labels/values to `index` and `columns` attributes of a dataframe object.

In [9]:
arr = np.random.randint(10,100, size= (6,5))
df = pd.DataFrame(data=arr)
df

Unnamed: 0,0,1,2,3,4
0,70,73,78,51,75
1,80,58,72,75,23
2,38,51,64,96,57
3,70,73,98,97,55
4,42,72,90,85,87
5,10,67,69,44,97


In [10]:
row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']
col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4']

df.columns = col_labels
df.index = row_labels
df

Unnamed: 0,Col0,Col1,Col2,Col3,Col4
Row0,70,73,78,51,75
Row1,80,58,72,75,23
Row2,38,51,64,96,57
Row3,70,73,98,97,55
Row4,42,72,90,85,87
Row5,10,67,69,44,97


### c. Creating a Dataframe from a Dictionary of Python Lists
- You can create a dataframe object from a dictionary of Python Lists 
    - The dictionary `Keys` become the column names, and 
    - The dictionary `Values` are lists/arrays containing data for the respective columns.

In [11]:
people = {
    "name" : ["Ehtisham", "Ali", "Ayesha", "Dua", "Khubaib", "Adeen"],
    "age" : [21, 20, 18, 17, 12, 10],
    "address": ["Lahore", "Karachi", "Lahore", "Islamabad", "Kakul", "Karachi"],
    "cell" : ["321-123", "320-431", "321-478", "324-446", "321-967", "320-678"],
    "bg": ["B+", "A-", "B+", "O-", "A-", "B+"]
}
people

{'name': ['Ehtisham', 'Ali', 'Ayesha', 'Dua', 'Khubaib', 'Adeen'],
 'age': [21, 20, 18, 17, 12, 10],
 'address': ['Lahore', 'Karachi', 'Lahore', 'Islamabad', 'Kakul', 'Karachi'],
 'cell': ['321-123', '320-431', '321-478', '324-446', '321-967', '320-678'],
 'bg': ['B+', 'A-', 'B+', 'O-', 'A-', 'B+']}

In [12]:
# Pass this Dictionary of Python Lists to pd.Dataframe()
df_people = pd.DataFrame(data=people)
df_people

Unnamed: 0,name,age,address,cell,bg
0,Ehtisham,21,Lahore,321-123,B+
1,Ali,20,Karachi,320-431,A-
2,Ayesha,18,Lahore,321-478,B+
3,Dua,17,Islamabad,324-446,O-
4,Khubaib,12,Kakul,321-967,A-
5,Adeen,10,Karachi,320-678,B+


- Note that column labels are set as per the keys inside the dictionary object, while the row labels/indices are set to default numerical values.
- You can set the row indices while creating the dataframe by passing the index argument to `pd.DataFrame()` method, or can do that later by assigning the new values to the `index` and `columns` attributes of a dataframe object.

In [13]:
# Let us change the row labels of above dataframe
row_labels = ['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06']
df_people.index = row_labels
df_people

Unnamed: 0,name,age,address,cell,bg
MS01,Ehtisham,21,Lahore,321-123,B+
MS02,Ali,20,Karachi,320-431,A-
MS03,Ayesha,18,Lahore,321-478,B+
MS04,Dua,17,Islamabad,324-446,O-
MS05,Khubaib,12,Kakul,321-967,A-
MS06,Adeen,10,Karachi,320-678,B+


### d. Creating a Dataframe from Dictionary of Panda Series
One can think of a dataframe as a dictionary of Panda Series: 
- `Keys` are column names, and 
- `Values` are Series object for the respective columns.

In [14]:
dict = {
    "name": pd.Series(['Ehtisham', 'Ali', 'Ayesha']),
    "age": pd.Series([21, 19, 16]),
    "addr": pd.Series(['Lahore', 'Islamabad','Karachi']),
}
df = pd.DataFrame(data=dict)
df

Unnamed: 0,name,age,addr
0,Ehtisham,21,Lahore
1,Ali,19,Islamabad
2,Ayesha,16,Karachi


>Note from the above output, that every series object becomes the data of the appropriate column. Moreover, the keys of the dictionary become the column labels.

In [15]:
dict1 = {
    "name": pd.Series(data=['Ehtisham', 'Ali', 'Ayesha', 'Dua'], index=['a','b','c', 'd']),
    "age": pd.Series(data=[21, 22,np.nan, 18], index=['a','b','c','d']),
    "addr": pd.Series(data=['Lahore', '', 'Peshawer','Karachi'], index=['a','b','c', 'd']),
}
df = pd.DataFrame(dict1)
df

Unnamed: 0,name,age,addr
a,Ehtisham,21.0,Lahore
b,Ali,22.0,
c,Ayesha,,Peshawer
d,Dua,18.0,Karachi


>- In the above code and its output, note that every series object has four data values and four corresponding indices.
>- Also note that in the `age` series, we have a NaN value, and in the `addr` series we have an empty string.
>- Another point to note that the row indices of the three series exactly match, in number as well as in sequence/value.
>- A question arise, what if the indices of series are different. See the following code to understand this concept.

In [16]:
dict1 = {
    "name": pd.Series(data=['Ehtisham', 'Ali', 'Ayesha', 'Dua'], index=['a','b','c', 'd']),
    "age": pd.Series(data=[21, 22,np.nan, 18], index=['a','x','y','d']),
    "addr": pd.Series(data=['Lahore', '','Karachi'], index=['a', 'd', 'x']),
}
df = pd.DataFrame(dict1)
df

Unnamed: 0,name,age,addr
a,Ehtisham,21.0,Lahore
b,Ali,,
c,Ayesha,,
d,Dua,18.0,
x,,22.0,Karachi
y,,,


>- In the above code and its output, note that first series object has four data values and four corresponding indices. Similarly, second series object has four data values (with one `np.nan` value) and four corresponding indices, which are a bit different from the first series object. Third series has three data values (with one empty string) and three indices.
>- Note the resulting Dataframe has six rows and three columns.
    - For index 'a' we have value in all the three series objects or columns.
    - For index 'b' we have a value in first series object, and NaN for the second and third column, since the second and third series object has no value corresponding to row index 'b.

## 3. Attributes of Pandas Dataframe
- Like Series, we can access properties/attributes of a dataframe by using dot `.` notation

In [17]:
people = {
    "name" : ["Ehtisham", "Ali", "Ayesha", "Dua", "Khubaib", "Adeen"],
    "age" : [21, 20, 18, 17, 12, 10],
    "address": ["Lahore", "Karachi", "Lahore", "Islamabad", "Kakul", "Karachi"],
    "cell" : ["321-123", "320-431", "321-478", "324-446", "321-967", "320-678"],
    "bg": ["B+", "A-", "B+", "O-", "A-", "B+"]
}
# people

df_people = pd.DataFrame(data=people, index=['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'])
df_people

Unnamed: 0,name,age,address,cell,bg
MS01,Ehtisham,21,Lahore,321-123,B+
MS02,Ali,20,Karachi,320-431,A-
MS03,Ayesha,18,Lahore,321-478,B+
MS04,Dua,17,Islamabad,324-446,O-
MS05,Khubaib,12,Kakul,321-967,A-
MS06,Adeen,10,Karachi,320-678,B+


In [18]:
# `shape` attribute of a dataframe object return a two value tuple containing rows and columns
# Note the rows count does not include the column labels and column count does not include the row index
df_people.shape 

(6, 5)

In [19]:
# `ndim` attribute of a dataframe object returns number of dimensions (which is always 2)
df_people.ndim

2

In [20]:
# `size` attribute of a dataframe object returns the number of elements in the underlying data
df_people.size

30

In [21]:
# `index` attribute of a dataframe object return the list of row indices and its datatype
df_people.index

Index(['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'], dtype='object')

In [22]:
# `columns` attribute of a dataframe object return the list of column labels and its datatype
df_people.columns

Index(['name', 'age', 'address', 'cell', 'bg'], dtype='object')

In [23]:
#This attribute is used to fetch both index and column names.
df_people.axes

[Index(['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'], dtype='object'),
 Index(['name', 'age', 'address', 'cell', 'bg'], dtype='object')]

In [24]:
# `values` attribute of a dataframe object returns a NumPy 2-D having all the values in the DataFrame, 
# without the row indices and column labels
df_people.values

array([['Ehtisham', 21, 'Lahore', '321-123', 'B+'],
       ['Ali', 20, 'Karachi', '320-431', 'A-'],
       ['Ayesha', 18, 'Lahore', '321-478', 'B+'],
       ['Dua', 17, 'Islamabad', '324-446', 'O-'],
       ['Khubaib', 12, 'Kakul', '321-967', 'A-'],
       ['Adeen', 10, 'Karachi', '320-678', 'B+']], dtype=object)

In [25]:
df.empty

False

In [26]:
# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe
df_people.dtypes

name       object
age         int64
address    object
cell       object
bg         object
dtype: object

In [27]:
# To check number on non-NA values
df_people.count()

name       6
age        6
address    6
cell       6
bg         6
dtype: int64

# Bonus

#### The `df.info()` Method

In [28]:
#This method prints information about a DataFrame including the row indices, column labels, 
# non-null values count in each column, datatype and memory usage
df_people.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, MS01 to MS06
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     6 non-null      object
 1   age      6 non-null      int64 
 2   address  6 non-null      object
 3   cell     6 non-null      object
 4   bg       6 non-null      object
dtypes: int64(1), object(4)
memory usage: 288.0+ bytes


#### The `df.describe()` Method

In [29]:
# This method prints the descriptive view of all the columns of dataset which are numeric
df.describe()

Unnamed: 0,age
count,3.0
mean,20.333333
std,2.081666
min,18.0
25%,19.5
50%,21.0
75%,21.5
max,22.0


## Check Your Concepts:
- What is Dataframe in pandas?
- Make a Pandas DataFrame with two-dimensional list | Python 
- Python | Creating DataFrame from dict of narray/lists 
- Python | Creating DataFrame from dict of narray/lists 
- Creating Pandas dataframe using list of lists 
- Creating a Pandas dataframe using list of tuples 
- Create a Pandas DataFrame from List of Dicts 
- Python | Convert list of nested dictionary into Pandas dataframe 
- Replace values in Pandas dataframe using regex 
- Creating a dataframe from Pandas series 
- Construct a DataFrame in Pandas using string data 
- Clean the string data in the given Pandas Dataframe 
- Reindexing in Pandas DataFrame 
- Mapping external values to dataframe values in Pandas 
- Reshape a pandas DataFrame using stack, unstack and melt method 
- Reset Index in Pandas Dataframe 
- Python | Change column names and row indexes in Pandas DataFrame 


# Pandas  - Assignment no 03
- Here is link of [Pandas - Assignment no 03]()