# Creating and Operating on pandas Dataframes

Suppose we need to create a DataFrame for a retail chain to record the following information:
![image-6.png](attachment:image-6.png)

We may construct the DataFrame from a dictionary.

In [1]:
# First construct the dictionary
stores_dict ={'manager':['Ali','MAdhura','Victoria','Mary','Nina','Mia'],
              'city':['Hamilton','Toronto','Burlington','Brantford','Niagara','Oakvill'],
             'employess':[100,20,30,58,150,167]}
stores_dict

{'manager': ['Ali', 'MAdhura', 'Victoria', 'Mary', 'Nina', 'Mia'],
 'city': ['Hamilton',
  'Toronto',
  'Burlington',
  'Brantford',
  'Niagara',
  'Oakvill'],
 'employess': [100, 20, 30, 58, 150, 167]}

In [2]:
import pandas as pd
# Then, construct the DataFrame
stores = pd.DataFrame(stores_dict)
stores

Unnamed: 0,manager,city,employess
0,Ali,Hamilton,100
1,MAdhura,Toronto,20
2,Victoria,Burlington,30
3,Mary,Brantford,58
4,Nina,Niagara,150
5,Mia,Oakvill,167



Notice how we first created a dictionary by the name stores_dict, which was used in the `DataFrame()` function to create a pandas Dataframe stores. (**Note**: pay attention to the  the upper and lower case letters in the  `DataFrame()` function)

# Indexing, Slicing, and Filtering

## Column Names and Row Indexes

We can extract the column (variable) names

In [3]:
# Get the column names
stores.columns

Index(['manager', 'city', 'employess'], dtype='object')

In [4]:
# Accessing a column name
stores.columns[0]

'manager'

We can also get the indexex of the rows

In [5]:
# Get the indexes of all rows
stores.index

RangeIndex(start=0, stop=6, step=1)

In [6]:
# Show it as a list
list(stores.index)

[0, 1, 2, 3, 4, 5]

## Selecting elements of Dataframes

### Selecting rows

Rows (observations) of a Dataframe can be selected using the `iloc` property (short for **integer location**).

For example, suppose we want to extract the second row.


In [7]:
stores.iloc[1]

manager      MAdhura
city         Toronto
employess         20
Name: 1, dtype: object

Extracting **multiple** rows:

* First, let's extract a range of rows; e.g., the second to the fourth row (rows with indexes 1 to 3). 

* Second, let's extract the first, third, and fifth row (indexes 0, 2, 4), which is not a continuous range of rows

In [8]:
# First task
stores.iloc[1:4]

Unnamed: 0,manager,city,employess
1,MAdhura,Toronto,20
2,Victoria,Burlington,30
3,Mary,Brantford,58


In [9]:
# Second task
stores.iloc[[0,2,4]]

Unnamed: 0,manager,city,employess
0,Ali,Hamilton,100
2,Victoria,Burlington,30
4,Nina,Niagara,150


If we want to index certain rows and particular columns, we need to include both row indexes and column indexes in `iloc[]`.

For example, to get 'manager' (column 0) and 'employees' (column 2) from rows 0,2,4:

In [10]:
stores.iloc[[0,2,4],[0,2]]

Unnamed: 0,manager,employess
0,Ali,100
2,Victoria,30
4,Nina,150


Another attribute `loc` of a DataFrame can be used for indexing.

* Unlike `iloc`, the attribute `loc` can use labels (column names) instead of indexes

In [11]:
# Get the 'manager' and 'employees' column from rows 0, 2, and 4
stores.loc[[0,2,4],['manager','employess']]

Unnamed: 0,manager,employess
0,Ali,100
2,Victoria,30
4,Nina,150


### Selecting columns

We can select column using `name_of_df[column_label]`, for exampe:

In [12]:
# Selecting the 'city' column
stores['city']

0      Hamilton
1       Toronto
2    Burlington
3     Brantford
4       Niagara
5       Oakvill
Name: city, dtype: object

In [13]:
# Selecting 'manager' and 'city'
stores[['manager','city']]

Unnamed: 0,manager,city
0,Ali,Hamilton
1,MAdhura,Toronto
2,Victoria,Burlington
3,Mary,Brantford
4,Nina,Niagara
5,Mia,Oakvill


Equivalently we can use `iloc` or `loc` with all the rows selected

In [14]:
stores.iloc[:,0]

0         Ali
1     MAdhura
2    Victoria
3        Mary
4        Nina
5         Mia
Name: manager, dtype: object

In [15]:
stores.loc[:,'manager']

0         Ali
1     MAdhura
2    Victoria
3        Mary
4        Nina
5         Mia
Name: manager, dtype: object

We can even use `name_of_df.name_of_column`!

In [16]:
stores.city

0      Hamilton
1       Toronto
2    Burlington
3     Brantford
4       Niagara
5       Oakvill
Name: city, dtype: object

Selecting more than 1 column using `df[list_of_column_names]`

In [17]:
stores[['city','manager']]

Unnamed: 0,city,manager
0,Hamilton,Ali
1,Toronto,MAdhura
2,Burlington,Victoria
3,Brantford,Mary
4,Niagara,Nina
5,Oakvill,Mia


## Adding or deleting columns and rows

Let's add a column of the stores' revenue in a month, say, 341, 280, 300, 260, 213, 182 (in thousand $).

* By default, the column will be added as the last column

In [18]:
# Adding a column of the store revues: 341, 280, 300, 260, 213, 182
stores['revenue'] = [341, 280, 300, 260, 213, 182]
stores

Unnamed: 0,manager,city,employess,revenue
0,Ali,Hamilton,100,341
1,MAdhura,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
4,Nina,Niagara,150,213
5,Mia,Oakvill,167,182


We can also create a new column based on existing ones.

For example, we can define a column of **revenue_per_employee** $=\frac{\rm revenue}{\rm employees}$.

In [19]:
stores['rev_per_emp']=stores['revenue']/stores['employess']
stores

Unnamed: 0,manager,city,employess,revenue,rev_per_emp
0,Ali,Hamilton,100,341,3.41
1,MAdhura,Toronto,20,280,14.0
2,Victoria,Burlington,30,300,10.0
3,Mary,Brantford,58,260,4.482759
4,Nina,Niagara,150,213,1.42
5,Mia,Oakvill,167,182,1.08982


In the above DataFrame, each value of the variable 'rev_emp' is equal the value of 'revenue' divided by 'employees' in the same column.

To drop a row, say row 4, we can use the `drop()` function and specify the index of the row to drop

In [20]:
stores.drop(4,inplace=True)
stores

Unnamed: 0,manager,city,employess,revenue,rev_per_emp
0,Ali,Hamilton,100,341,3.41
1,MAdhura,Toronto,20,280,14.0
2,Victoria,Burlington,30,300,10.0
3,Mary,Brantford,58,260,4.482759
5,Mia,Oakvill,167,182,1.08982


In [21]:
stores

Unnamed: 0,manager,city,employess,revenue,rev_per_emp
0,Ali,Hamilton,100,341,3.41
1,MAdhura,Toronto,20,280,14.0
2,Victoria,Burlington,30,300,10.0
3,Mary,Brantford,58,260,4.482759
5,Mia,Oakvill,167,182,1.08982


**Note**: Observe that we have used the argument `inplace = True`. If this argument is not added, the DataFrame itself will not be modified but returns a copy of the DataFrame without the deleted row.

By adding `inplace = True`, we indicate we want to modify the DataFrame in place.

`drop()` delete rows by default. 

To delete a column, we can specify an argument `axis=1`

In [22]:
stores.drop('rev_per_emp',axis=1,inplace=True)
stores

Unnamed: 0,manager,city,employess,revenue
0,Ali,Hamilton,100,341
1,MAdhura,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
5,Mia,Oakvill,167,182


**Note**: Again notice that we have again included the argument `inplace = True` indicating that we want the Dataframe to actually be modified.

We can add a row by assigning a new row (as a list) to an row index that did not exist in the DataFrame

In [23]:
stores.loc[6] = ['John','Brampton',12,99]
stores

Unnamed: 0,manager,city,employess,revenue
0,Ali,Hamilton,100,341
1,MAdhura,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
5,Mia,Oakvill,167,182
6,John,Brampton,12,99


## Resetting the indexes after deletion

The row indexes become discontinous after dropping a row.

To fix, we can use `reset_index()` to reset the indexes.

In [24]:
stores.reset_index(drop=True,inplace=True)
stores

Unnamed: 0,manager,city,employess,revenue
0,Ali,Hamilton,100,341
1,MAdhura,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
4,Mia,Oakvill,167,182
5,John,Brampton,12,99


## Filtering a Dataframe

Suppose that we would like to identify the under-staffed stores. 

We can do so by filtering the Dataframe based on a certain condition, by `df[condition]`. 
* The result of the filtering will be all the rows that satisfy the certain condition. 
* For example, let's get the stores that had fewer than 50 employees.

In [25]:
stores_under_staff = stores[stores['employess'] < 50]
stores_under_staff

Unnamed: 0,manager,city,employess,revenue
1,MAdhura,Toronto,20,280
2,Victoria,Burlington,30,300
5,John,Brampton,12,99


We can also use the symbols `&` (AND) and `|` (OR) in filtering a Dataframe.

For example, we can filter out stores with fewer than 50 employees and a revenue over 150.

In [26]:
stores[(stores['employess']<50) & (stores['revenue']>150)]

Unnamed: 0,manager,city,employess,revenue
1,MAdhura,Toronto,20,280
2,Victoria,Burlington,30,300


A condition using OR (using the symbol "|") can be done similarly.

## Modifying the values of some entries

Suppose that a person named Ray is the new manager of the Windsor store, and we need to update the record 


In [27]:
# View the DataFrame again
stores.loc[1,'manager'] = 'Ray'
stores

Unnamed: 0,manager,city,employess,revenue
0,Ali,Hamilton,100,341
1,Ray,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
4,Mia,Oakvill,167,182
5,John,Brampton,12,99


## Converting a Dataframe (or part of it) to an array

Sometimes we need to convert parts of a DataFrame to an array
* May be useful in some machine learning algorithms, because some algorithms work well with NumPy arrays and require us to convert parts of the Dataframe that we are working with to an array before operation.

In such cases, we may use the `values` attribute of a DataFrame.
* Using `values` will return an array without the headers

In [28]:
# Convert the whole DataFrame to an array
stores_array = stores.values
stores_array

array([['Ali', 'Hamilton', 100, 341],
       ['Ray', 'Toronto', 20, 280],
       ['Victoria', 'Burlington', 30, 300],
       ['Mary', 'Brantford', 58, 260],
       ['Mia', 'Oakvill', 167, 182],
       ['John', 'Brampton', 12, 99]], dtype=object)

In [29]:
# Conver part of the DataFrame into an array
stores.loc[0:4,['employess','revenue']].values

array([[100, 341],
       [ 20, 280],
       [ 30, 300],
       [ 58, 260],
       [167, 182]])

We can convert an array to a DataFrame by adding a header

In [30]:
pd.DataFrame(stores_array,columns=['names','city','employees','revenue'])

Unnamed: 0,names,city,employees,revenue
0,Ali,Hamilton,100,341
1,Ray,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
4,Mia,Oakvill,167,182
5,John,Brampton,12,99


## Renaming one (or more columns)
Suppose that we would like to change the column name 'city' to 'location', and 'employess' to 'staffing'. We may use the `df.rename()` function. 

The input argument to this function should be a mapping from the old names to the new names (in the form of a dictionary).

In [31]:
stores=stores.rename(columns={'employess':'employees','city':'location'})
stores

Unnamed: 0,manager,location,employees,revenue
0,Ali,Hamilton,100,341
1,Ray,Toronto,20,280
2,Victoria,Burlington,30,300
3,Mary,Brantford,58,260
4,Mia,Oakvill,167,182
5,John,Brampton,12,99
