# Transforming Data 

Now that we have covered the loading and cleaning of data, we will dive into how we can transform our DataFrames to discover powerful insights and relevant information. We can transform our data based on what type of insights we want to discover and our objectives. 

To start off the lesson we will import the relevant packages, then we will be using the `read.csv()` function to import our data as a DataFrame. For this lesson, we will be importing data of student grades.

## Selecting Columns and Rows

To select a specific column, simply put the name of the column after the DataFrame in square brackets in quotation. We will see that pandas will output this column as a Series.

If there was a scenario where the StudentID along with the grade of the student is to be shown, multiple columns can be shown by submitting a list of columns.

One simple way of getting the first few rows or the last few rows of the dataset is the use the `head()` and `tail()` function. 

`head()` populates the first 5 rows of the DataFrame, whereas `tail()` populates the last 5 rows.

In [None]:
# First 5 rows


In [None]:
# Last 5 rows


## loc() & iloc()
Another more scalable alternative in selecting columns is the use of `loc()` and `iloc()` functions. 

`loc()` is label-based, which means that we still have to specify the name of the rows and columns by their labels.

`iloc()` is integer-based, which means that you will have to specify the rows and columns by their index.

loc[row_label, column_label]

iloc[row_position, column_position]

Instead of selecting columns/rows you want, these functions will allow you to select columns/rows between or up to a certain column/row name/position.

When using these functions, the `:` symbol means choose all columns/rows in between the left and right of the symbol. If there is no value on either side of the symbol, it means choose all columns/rows.


    

In [None]:
# Selecting all rows, with only specified columns


In [None]:
# Using iloc function, selecting all rows, with only specified columns


In [None]:
# Selecting all rows, with all columns in order from StudentID to Grade


In [None]:
# Selecting all rows, with columns in order from StudentID to Grade


We can also use the `loc()` and `iloc()` functions to select rows as well as columns. In `df_grades`, the DataFrame row index is numeric. In this instance, because the index names are a number, we can use a numeric argument in the `loc()` function. 

In [None]:
# Selecting rows based on index name, with all columns present


In [None]:
# Selecting rows based on index, with all columns present


One key note is that when using `loc`, the value after the `:` is included in the result, whereas the `iloc` does not.

In [None]:
# Selecting rows in between based on index name, with all columns present


In [None]:
# Selecting rows in between based on index, with all columns present


## Conditional Selection

Suppose that we want to filter the data based on specific conditions, such as students who have an outstanding tuition amount of over 40,000 or only students that are in the arts faculty. With conditional statements, we can filter the data to find specific information and insights. 

To write a conditional statement, we will have to write a boolean statement that classifies each value as `True` or `False`, then pandas will then filter for the values that are `True` that match the logic.

In [None]:
# Pandas reads all instances where the row is either True or False, then this logic is inputted to the dataframe condition


In [None]:
# Tuition amount of at least 40k


Suppose we want information on only one faculty:

In [None]:
# Students that are in the arts faculty


Multiple conditions can be inputted as well. Suppose we want to see student information that skipped at least 3 classes and have participated in 5 or more office hours.

In [None]:
# Students that skipped at least 3 classes and participated in at least 5 hours of office hours


## Adding/Removing Columns and Rows

To add new columns to our DataFrame, we can simply declare a new list as a column. Consider a scenario where we want to update our student data with the corresponding cities of the students.

In [None]:
# Keep in mind, the length of the list has to match the length of index of the DataFrame
city = ['Vancouver','Toronto','Calgary','Edmonton','Regina', 'Burnaby', 'Coquitlam', 'London', 'Ottawa', 'Texas',
       'Coquitlam', 'London', 'Ottawa', 'Texas','Edmonton','Regina', 'Burnaby', 'Coquitlam', 'London', 'Ottawa', 
        'Texas', 'Toronto','Calgary','Edmonton','Regina', 'Burnaby', 'Coquitlam', 'London',  'Texas','Edmonton'
       ]

df_grades['City'] = city
df_grades

Another method of adding new columns is to use the `insert()` function. This function allows us to add the column in any position we like and not only the end. Consider a scenario where now we want to add the age of students in our data, and want to see it before their name.

In [None]:
# Adding a new column in index 1 (2nd column)

df_grades.insert(1, "Age", [21, 23, 24, 21, 25, 18, 22, 25, 28, 34, 
                     22, 23, 25, 21, 25, 26, 24, 23, 29, 31, 
                     28, 24, 24, 23, 22, 20, 23, 25, 28, 33 
                    ])
df_grades

While this is uncommmon, we can add new rows to the DataFrame as well.

Let's consider a scenario where there is a new student that needs to be submitted to the system. 

To add new rows, we will have to create a new DataFrame with the data we want to add then use the `append()` function. 

In [None]:
new_student = {'StudentID': 20123420, 'Age':21, 'FirstName': 'Scottie', 'LastName': 'Barnes', 'GradeAverage': 'A', 
               'Faculty': 'Science', 'Tuition': 50000,'OfficeHoursParticipated': 0, 'ClassesSkipped': 0, 'City': 'Toronto'}

df2 = pd.DataFrame(data=new_student, index=[30])

df_grades = df_grades.append(df2)
df_grades.tail()

To remove columns and rows, we can simply use the `drop()` function. If a student is no longer attending the school or find a categeory to be irrelevant, we can simply remove the data. 

Consider a scenario where the last person to register dropped out of school and their information is to be deleted.

In [None]:
# Drop rows with index number


Anothe way of deleting rows is based on condition. Suppose that all the students in business faculty dropped out.

In [None]:
# Drop rows based on column value


We can also drop entire columns as well. Suppose that city of students is no longer a relevant information to be kept in the system.

In [None]:
# Drop columns


## Creating New Indicies

We can modify the indicies in our DataFrame so that it is more relevant to our needs, rather than the standard numbering system. One example of this would be if we were to use our StudentIDs as our index. We can use the `set_index()` function to set this as our index.

However, if we wanted to revert back to the old index, we can simply ues the `reset_index()` function to revert back.

## Grouping Data

Another powerful way to derive insights from your data is the use of a `groupby()` function. This involves combining like values together to generate aggregated values associated with the combined values.

Some common functions that are used after a `groupby` functions include:

    1. mean
    2. median
    3. count
    4. sum

Consider for example we wanted to group the number of students by faculty, then count the number of students per faculty.

In [None]:
# Selecting the count of StudentIDs, grouped by each faculty


If we wanted to find the average age by faculty:

In [None]:
# Selecting the mean of the ages, grouped by each faculty


Suppose we wanted to find the tuition spent for each faculty, broken down by grade. 

In [None]:
# Selecting the average age of each grade in each faculty


We can see that based on the example above, we can uncover insights that were not obvious to us previously, such as the engineering faculty having no students with an A average, no students have an F average, and their B and C average students pay the most out of all faculties.

## Concatenation of DataFrames

In a corporate environment, there will be many instances where multiple datasets will need to be combined. In this instance, the `concat()` function will allow us to combine DataFrames together into one.

Suppose we have separate DataFrame of student information in another server:

In [None]:
data2 = {'StudentID': [20123420,20123421], 'Age':[33,31], 'FirstName': ['Stephen','Klay'], 
         'LastName': ['Curry','Thompson'], 'GradeAverage': ['A','A'], 'Faculty': ['Science','Math'], 
         'Tuition': [31000,41000], 'OfficeHoursParticipated': [3,1], 'ClassesSkipped': [4,6], 
         'State': ['California','California']}
df_grades2 = pd.DataFrame(data2)
df_grades2

In order to combine the two DataFrames together, there are two ways of concatenating:

1. Concatenating the DataFrames horizontally
2. Concatenating the DataFrames vertically

We will first go through concatenating vertically. We see that the two rows in the second DataFrame have been added below the first. However in the results below we see that the `State` column was added, because the second DataFrame had this column. Therefore, the values in the first DataFrame that did not have this information will be shown as `NaN`.

In [None]:
#concatentate two dataframes vertically, adding addtional rows


If we were to concatenate the two DataFrames horizontally, it would not look great, as we are combining datasets that have the same columns. When combining two DataFrames, we must consider if we want to add more data in the column level or row level before deciding to concatenate horizontally or vertically.

In [None]:
# concatenate two dataframes horizontally, adding addtional columns
