# Introduction to Pandas DataFrame

pandas **DataFrames** are data structures that contain:

-   **Data** organized in **two dimensions**, rows and columns
-   **Labels** that correspond to the **rows** and **columns**

DataFrames are similar to SQL tables or the spreadsheets that we work with in Excel or Calc.

Imagine w’re using pandas to analyze data about job candidates for a position developing web applications with Python. Say w’re interested in the candidates’ names, cities, ages, and scores on a Python programming test, or py-score:

<table class="table table-hover">
<thead>
<tr>
<th></th>
<th><code>name</code></th>
<th><code>city</code></th>
<th><code>age</code></th>
<th><code>py-score</code></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong><code>101</code></strong></td>
<td><code>Xavier</code></td>
<td><code>Mexico City</code></td>
<td><code>41</code></td>
<td><code>88.0</code></td>
</tr>
<tr>
<td><strong><code>102</code></strong></td>
<td><code>Ann</code></td>
<td><code>Toronto</code></td>
<td><code>28</code></td>
<td><code>79.0</code></td>
</tr>
<tr>
<td><strong><code>103</code></strong></td>
<td><code>Jana</code></td>
<td><code>Prague</code></td>
<td><code>33</code></td>
<td><code>81.0</code></td>
</tr>
<tr>
<td><strong><code>104</code></strong></td>
<td><code>Yi</code></td>
<td><code>Shanghai</code></td>
<td><code>34</code></td>
<td><code>80.0</code></td>
</tr>
<tr>
<td><strong><code>105</code></strong></td>
<td><code>Robin</code></td>
<td><code>Manchester</code></td>
<td><code>38</code></td>
<td><code>68.0</code></td>
</tr>
<tr>
<td><strong><code>106</code></strong></td>
<td><code>Amal</code></td>
<td><code>Cairo</code></td>
<td><code>31</code></td>
<td><code>61.0</code></td>
</tr>
<tr>
<td><strong><code>107</code></strong></td>
<td><code>Nori</code></td>
<td><code>Osaka</code></td>
<td><code>37</code></td>
<td><code>84.0</code></td>
</tr>
</tbody>
</table>

In this table, the first row contains the **column labels** (name, city, age, and py-score). The first column holds the **row labels** (101, 102, and so on). All other cells are filled with the **data values**.

We can create a **pandas DataFrame** usually by passing **data** and **labels** to the **DataFrame** constructor. 

### Creating a pandas DataFrame With Dictionaries

**data** can be provided as a list, tuple, NumPy array, dictionary, or pandas Series.

To create **DataFrame** for candidate table, we will be using **data** in form of dictionary as below :-

```python
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [101, 102, 103, 104, 105, 106, 107] 
 ```

It contains the labels of the **columns**:

'name'
'city'
'age'
'py-score'


**row_labels** refers to a list that contains the labels of the rows, which are numbers ranging from 101 to 107.






In [3]:
import pandas as pd

data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

# Creating dataframe

df = pd.DataFrame(data = data,index=row_labels)

print(df)



       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0
103    Jana       Prague   33      81.0
104      Yi     Shanghai   34      80.0
105   Robin   Manchester   38      68.0
106    Amal        Cairo   31      61.0
107    Nori        Osaka   37      84.0


That’s it! ``df`` is a variable that holds the reference to your pandas DataFrame. This pandas DataFrame looks just like the candidate table above and has the following features:

-   ``Row labels`` from 101 to 107
-   ``Column labels`` such as 'name', 'city', 'age', and 'py-score'
-   ``Data`` such as candidate names, cities, ages, and Python test scores

This figure shows the labels and data from df:

![image.png](attachment:image.png)

### head()

pandas DataFrames can sometimes be very large, making it impractical to look at all the rows at once. 

We can use ``.head()`` to show the first few items and ``.tail()`` to show the last few items.

By default, ``.head()`` shows first five rows and ``.tail()``shows last five rows.

In [10]:

print('First five rows')
print(df.head())

print('***************************')
print('Last five rows')
print('****************************')

print(df.tail())

First five rows
       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0
103    Jana       Prague   33      81.0
104      Yi     Shanghai   34      80.0
105   Robin   Manchester   38      68.0
***************************
Last five rows
****************************
      name        city  age  py-score
103   Jana      Prague   33      81.0
104     Yi    Shanghai   34      80.0
105  Robin  Manchester   38      68.0
106   Amal       Cairo   31      61.0
107   Nori       Osaka   37      84.0


```python
df.head(n) # show first n rows
df.tail(n) # show last n rows
```

In [11]:

print('****** Print first 2 rows ********')
print(df.head(2))

print('******* Print last 2 rows *******')
print(df.tail(2))

****** Print first 2 rows ********
       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0
******* Print last 2 rows *******
     name   city  age  py-score
106  Amal  Cairo   31      61.0
107  Nori  Osaka   37      84.0


>**Note:** It may be helpful to think of the pandas DataFrame as a dictionary of columns, or pandas Series.

To access a column in a pandas DataFrame:

```**Syntax :** df[Column Name] or df.ColumnName```


In [12]:
df['city']

101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

In [13]:
df.city

101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

It’s important to notice that we’ve extracted both the data and the corresponding row labels:

![image.png](attachment:image.png)

Each column of a pandas DataFrame is an instance of ``pandas.Series``, a structure that holds one-dimensional data and their labels.

We can get a single item of a Series object the same way we have done with a dictionary, by using its label as a key.

In [15]:
cities_series = df.city

# print value of city series at position 103
print(cities_series[102])

Toronto


In this case, 'Toronto' is the data value and 102 is the corresponding label. 

We can also access a whole row with the ``.loc[row_label]``.

In [17]:
df.loc[103]

name          Jana
city        Prague
age             33
py-score      81.0
Name: 103, dtype: object

This time, we’ve extracted the row that corresponds to the label 103, which contains the data for the candidate named Jana. 

In addition to the data values from this row, we’ve extracted the labels of the corresponding columns.

![image.png](attachment:image.png)

The returned row is also an instance of ``pandas.Series``.

In [20]:
# Example 1 : Creating a pandas DataFrame With Dictionaries values of dictionary as list,numpy array and tuple

import numpy as np

d = {'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': (100,200,300)}

df = pd.DataFrame(d)

print(df)

   x  y    z
0  1  2  100
1  2  4  200
2  3  8  300


It’s possible to control the order of the columns with the columns parameter and the row labels with index:

In [23]:
df = pd.DataFrame(d,index=[100,200,300],columns=['z','y','x'])
print(df)

       z  y  x
100  100  2  1
200  200  4  2
300  300  8  3


### Creating a pandas DataFrame With Lists

-   **Create DataFrame using list of dictionaries**

    Here,dictionary keys are the column labels, and the dictionary values are the data values in the DataFrame.

-   **Create DataFrame using list of list**

    Here, it would be wise to explicitly specify the labels of columns, rows, or both when creating the DataFrame.

In [26]:
print('Data frame using list of dictionary')

l1 = [{'x': 1, 'y': 2, 'z': 100},
      {'x': 2, 'y': 4, 'z': 100},
      {'x': 3, 'y': 8, 'z': 100}]

df = pd.DataFrame(l1)

print(df)

l2 = [[1,2,100],
      [2,4,100],
      [3,8,100]]

print('Data frame using nested list')

df = pd.DataFrame(l2,index=['A','B','C'],columns=['X','Y','Z'])

print(df)




Data frame using list of dictionary
   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100
Data frame using nested list
   X  Y    Z
A  1  2  100
B  2  4  100
C  3  8  100


### Creating a pandas DataFrame With NumPy Arrays

We can pass a two-dimensional NumPy array to the DataFrame constructor the same way you do with a list:

In [30]:
import numpy as np

l = np.array([[1,2,3],[17,19,18],[11,0,-1]])

df = pd.DataFrame(l,columns=['A','B','C'])

print(df)


    A   B   C
0   1   2   3
1  17  19  18
2  11   0  -1


This looks almost the same as the nested list implementation above, it has one advantage: We can specify the optional parameter copy.

When copy is set to False (its default setting), the data from the NumPy array isn’t copied. This means that the original data from the array is assigned to the pandas DataFrame. 

If we modify the array, then our DataFrame will change too:

In [32]:
l[0][0] = 1000
print(df)

      A   B   C
0  1000   2   3
1    17  19  18
2    11   0  -1


To create a dataframe from the copy of the values of numpy array ``l``,we should specify ``copy=True``.This way,df won't change if we change ``l``.

In [35]:
df = pd.DataFrame(l,columns=['A','B','C'],copy=True)

print(df)

l[0][0] = 123

print('Dataframe after modifying array l')

print(df)

     A   B   C
0  123   2   3
1   17  19  18
2   11   0  -1
Dataframe after modifying array l
     A   B   C
0  123   2   3
1   17  19  18
2   11   0  -1


# Creating a pandas DataFrame From Files

We can save your job candidate DataFrame to a CSV file with ``.to_csv()``:

In [38]:
import pandas as pd
import os

data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

directory = 'resources'

if not os.path.exists(directory):
    os.makedirs(directory)

file_path = os.path.join(directory,'candidate.csv')

df = pd.DataFrame(data = data,index=row_labels)

df.to_csv(file_path)

# load DataFrame from csv file

df1 = pd.read_csv(file_path,index_col=0)

print(df1)

       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0
103    Jana       Prague   33      81.0
104      Yi     Shanghai   34      80.0
105   Robin   Manchester   38      68.0
106    Amal        Cairo   31      61.0
107    Nori        Osaka   37      84.0


In this case, ``index_col=0`` specifies that the row labels are located in the first column of the CSV file.

### pandas DataFrame Labels as Sequences

We can get the DataFrame’s row labels with ``.index`` and its column labels with ``.columns`.

In [40]:
print(f'Row Labels : {df.index}')
print(f'Columns Labels : {df.columns}')

Row Labels : Index([101, 102, 103, 104, 105, 106, 107], dtype='int64')
Columns Labels : Index(['name', 'city', 'age', 'py-score'], dtype='object')


``row`` and ``column`` labels as special kinds of sequences. 

As we can with any other Python sequence, you can get a single item like ``df.index[1] or df.columns[1]``.

In addition to extracting a particular item, we can apply other sequence operations, including iterating through the labels of rows or columns.

In [49]:
print('Row Labels:->')

for row_label in df.index:
    print(f'{row_label}',sep = ',',end =' ')
    
# We can also use this approach to modify the labels:

row_labels = np.arange(100,107)

df.index = row_labels

print()
print('New Row Labels:->')

for row_label in df.index:
    print(f'{row_label}',sep = ',',end =' ')
    
print('Dataframe')
print(df)
    
    

Row Labels:->
100 101 102 103 104 105 106 
New Row Labels:->
100 101 102 103 104 105 106 Dataframe
       name         city  age  py-score
100  Xavier  Mexico City   41      88.0
101     Ann      Toronto   28      79.0
102    Jana       Prague   33      81.0
103      Yi     Shanghai   34      80.0
104   Robin   Manchester   38      68.0
105    Amal        Cairo   31      61.0
106    Nori        Osaka   37      84.0


### DataFrame as NumPy Arrays

To extract data from a pandas DataFrame without its label as NumPy array with the unlabeled data, we can use either ``.to_numpy()`` or ``.values``.

Both ``.to_numpy()`` and ``.values`` work similarly, and they both return a NumPy array with the data from the pandas DataFrame:

![image.png](attachment:image.png)

The pandas documentation suggests using ``.to_numpy()`` because of the flexibility offered by two optional parameters:

1. **dtype:** Use this parameter to specify the data type of the resulting array. It’s set to None by default.
copy: 

2. **copy:** Set this parameter to ``False`` if we want to use the original data from the DataFrame. Set it to ``True`` to make a copy of the data.

However, ``.values`` has been around for much longer than ``.to_numpy()``.

In [54]:
print('Extract DataFrame Data as Numpy Array using .to_numpy method')

print(df.to_numpy())

print('Extract DataFrame Data as Numpy Array using .values property')

print(df.values)

Extract DataFrame Data as Numpy Array using .to_numpy method
[['Xavier' 'Mexico City' 41 88.0]
 ['Ann' 'Toronto' 28 79.0]
 ['Jana' 'Prague' 33 81.0]
 ['Yi' 'Shanghai' 34 80.0]
 ['Robin' 'Manchester' 38 68.0]
 ['Amal' 'Cairo' 31 61.0]
 ['Nori' 'Osaka' 37 84.0]]
Extract DataFrame Data as Numpy Array using .values property
[['Xavier' 'Mexico City' 41 88.0]
 ['Ann' 'Toronto' 28 79.0]
 ['Jana' 'Prague' 33 81.0]
 ['Yi' 'Shanghai' 34 80.0]
 ['Robin' 'Manchester' 38 68.0]
 ['Amal' 'Cairo' 31 61.0]
 ['Nori' 'Osaka' 37 84.0]]


### Data Types

We can get the data types for each column of a pandas DataFrame with ``.dtypes``.

``.dtypes`` returns a Series object with the column names as labels and the corresponding data types as values.

To modify the data type of one or more columns, use ``.astype()``



In [59]:
print('Fetching Data Types of DataFrame')

print(df.dtypes)

print('Modifying the Data Types Of Columns')

df = df.astype(dtype={'age': np.int32, 'py-score': np.float32})

print(df.dtypes)

Fetching Data Types of DataFrame
name         object
city         object
age           int32
py-score    float32
dtype: object
Modifying the Data Types Of Columns
name         object
city         object
age           int32
py-score    float32
dtype: object


>Note:The most important and only mandatory parameter of .astype() is dtype. It expects a data type or dictionary.


### pandas DataFrame Size (.ndim,.shape and .size)

The attributes ``.ndim``, ``.shape``, and ``.size`` return the number of dimensions, number of data values across each dimension, and total number of data values, respectively:

-   ``.ndim`` returns dimensions of DataFrame instance which is 2. A series object has ``.ndim`` value -> 1

-   ``.shape`` attribute returns a tuple with the number of rows and the number of columns.

-   ``.size`` returns an integer equal to the number of values in the DataFrame

In [60]:
print(f'Dimension of DataFrame Instance df is : {df.ndim}')

print(f'Shape of DataFrame Instance df is : {df.shape}')

print(f'Size of DataFrame Instance df is : {df.size}')

Dimension of DataFrame Instance df is : 2
Shape of DataFrame Instance df is : (7, 4)
Size of DataFrame Instance df is : 28


### Getting Data With Accessors

pandas has four accessors in total:

-   ``.loc[]`` accepts the labels of rows and columns and returns Series or DataFrames. It can be used for getting entire rows or columns, as well as their parts.

-   ``.iloc[]`` accepts the zero-based indices of rows and columns and returns Series or DataFrames. It can be used for getting entire rows or columns, or their parts.

-   ``.at[]`` accepts the labels of rows and columns and returns a single data value.

-   ``.iat[]`` accepts the zero-based indices of rows and columns and returns a single data value.

In [71]:
print('Fetch All Cities using column name - city')

print(df.loc[:,'city'])

print('Fetch all data corresponding row label - 100')

print(df.loc[100])

print('Fetch All Cities using index of city column')

print(df.iloc[:,1])

print('Fetch all data corresponding row index - 0')

print(df.iloc[0])

print('Fetching value at row_label 100 and column city')

print(df.at[100,'city'])

print('Fetching value at row index 0 and column 1')

print(df.iat[0,1])




Fetch All Cities using column name - city
100    Mexico City
101        Toronto
102         Prague
103       Shanghai
104     Manchester
105          Cairo
106          Osaka
Name: city, dtype: object
Fetch all data corresponding row label - 100
name             Xavier
city        Mexico City
age                  41
py-score           88.0
Name: 100, dtype: object
Fetch All Cities using index of city column
100    Mexico City
101        Toronto
102         Prague
103       Shanghai
104     Manchester
105          Cairo
106          Osaka
Name: city, dtype: object
Fetch all data corresponding row index - 0
name             Xavier
city        Mexico City
age                  41
py-score           88.0
Name: 100, dtype: object
Fetching value at row_label 100 and column city
Mexico City
Fetching value at row index 0 and column 1
Mexico City


The ``slice`` construct (:) in the row label place means that all the rows should be included.

``df.loc[:, 'city']`` returns the column city and ``df.iloc[:, 1]`` returns the same column because the zero-based index 1 refers to the second column, city.

In [68]:
print('Fetch all rows from row_label 100 to 103(inclusive) for column name and city')

print(df.loc[100:103,['name','city']])

print('Fetch all rows from index 0 to 4(exclusive) for column index 0 and 1')

print(df.iloc[0:4,[0,1]])



Fetch all rows from row_label 100 to 103(inclusive) for column name and city
       name         city
100  Xavier  Mexico City
101     Ann      Toronto
102    Jana       Prague
103      Yi     Shanghai
Fetch all rows from index 0 to 4(exclusive) for column index 0 and 1
       name         city
100  Xavier  Mexico City
101     Ann      Toronto
102    Jana       Prague
103      Yi     Shanghai


With ``.loc[]``, however, both start and stop indices are inclusive, meaning they are included with the returned values.

With ``.iloc[]``, the stop index of a slice is exclusive, meaning it is excluded from the returned values.

We can skip rows and columns with .iloc[] the same way we can with slicing tuples, lists, and NumPy arrays:

In [69]:
print('Fetch all rows from index 1 to 5 -skipping every second row for column index - 0')

print(df.iloc[1:6:2, 0])

Fetch all rows from index 1 to 5 -skipping every second row for column index - 0
101     Ann
103      Yi
105    Amal
Name: name, dtype: object


### Setting Data With Accessors

We can use accessors to modify parts of a pandas DataFrame by passing a Python sequence, NumPy array, or single value.

In [None]:
print('Display all rows from row_label 100 to 104 for column py-score')

print(df.loc[:104,'py-score'])

print('Modifying all rows from row_label 100 to 104 for column py-score')

df.loc[:104,'py-score'] = [40,50,70,90,100] # Modifies the first four items 

print('Display all rows from row_label 100 to 104 for column py-score')

print(df.loc[:104,'py-score'])

df.loc[105:,'py-score'] = 0 # sets the remaining values in this column to 0

print('Display all rows from row_label 100 to 104 for column py-score')

print(df.loc[:,'py-score'])

df.iloc[:,-1] = np.array([88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0])

print('Display all rows from row_label 100 to 104 for column py-score')

print(df.loc[:,'py-score'])


Display all rows from row_label 100 to 104 for column py-score
100     40.0
101     50.0
102     70.0
103     90.0
104    100.0
Name: py-score, dtype: float32
Modifying all rows from row_label 100 to 104 for column py-score
Display all rows from row_label 100 to 104 for column py-score
100     40.0
101     50.0
102     70.0
103     90.0
104    100.0
Name: py-score, dtype: float32
Display all rows from row_label 100 to 104 for column py-score
100     40.0
101     50.0
102     70.0
103     90.0
104    100.0
105      0.0
106      0.0
Name: py-score, dtype: float32
Display all rows from row_label 100 to 104 for column py-score
100    88.0
101    79.0
102    81.0
103    80.0
104    68.0
105    61.0
106    84.0
Name: py-score, dtype: float32
