# **Data input, manipulation and output** (SOLUTIONS)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this exercise session, we will load data from a file, manipulate those data in Python and finally write it back to a file. These basic operations are important in any data science project. 

[Pandas](https://pandas.pydata.org/) is a Python package built for data analysis. It includes many useful functions, including one to read comma-separated (.csv) files. We will use this package as it's both easy to use and very powerful.

Firstly, we need to import Pandas. As you will be learning (or already know), popular Python packages have common abbreviates. These abbreviations are used as a short-hand for that package throughout a Python script. Some of these common abbreviations include:

- `import pandas as pd`
- `import numpy as np`
- `import tensorflow as tf`
- etc.

Secondly, we can call any function in the Pandas package as `pd.someFunction`. The `pd` keyword is reserved for Pandas and must not be used by any other variable. 

Thirdly, `import` statements are generally located at the top of a script, even though they might only be used at a later stage.

Finally, do not hesitate to search the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) for help.


--------

## Part 0: Setup

In [None]:
# Import pandas

import pandas as pd

# **MAIN EXERCISE**

## Part 1: Loading .csv data

In this part, we  load input data from the `epfl.csv` file into this Jupyer notebook. This file contains courses offered at EPFL with the following two columns: 

- course: full course name
- code: course code, including field of study


**Q 1**: Load .csv data. What shape does it have? 

In [None]:
# Load data 

EPFLcourses = pd.read_csv('data/epfl.csv')

In [None]:
# Look at the number of rows and columns (the shape)

EPFLcourses.shape

**Q 2**: Look at the data shape, first 5 and last 5 rows. Can you draw a random sample from the data?

In [None]:
# Look at the first 5 rows

EPFLcourses.head(n=5)

In [None]:
# Look at the last 5 rows

EPFLcourses.tail(n=5)

In [None]:
# Look at a random sample of 5 rows

EPFLcourses.sample(n=5)

## Part 2: Extract the course field

In this part, we extract the field code (e.g. CS, EE, MATH) from the course code column. We then create a new column called `field`.

**Q 1**: Create a new column and fill it with 0s.

In [None]:
# Create a dummy column called 'field' filled with 0s

EPFLcourses['field'] = 0
EPFLcourses.head(n=5)

**Q 2**: Fill the new column with the field of the course. What symbol do you split on?

In [None]:
# Fill the 'field' column with the split code

EPFLcourses['field'] = EPFLcourses['code'].str.split('-')
EPFLcourses.head(n=5)

In [None]:
# Select the first list element (index 0) for each field entry

EPFLcourses['field'] = EPFLcourses['field'].str[0]
EPFLcourses.head(n=5)

## Part 3: Clean data

In this part, we clean the dataset. We find missing fields and remove those entries.

**Q 1**: Look at some examples where a value for `field` is missing. What's a course without a code/field?

In [None]:
# Look at the entries without a field

EPFLcourses[EPFLcourses['field'].isna()].head(n=5)

**Q 2**: How many rows are removed by dropping all rows without a value for `field`?

In [None]:
# Inspect the shape of data before removing missing values

EPFLcourses.shape

In [None]:
# Remove entries without a description

EPFLcourses = EPFLcourses.dropna(subset=['field'])

In [None]:
# After removing entries, inspect the shape of data again

EPFLcourses.shape


81 rows have been dropped.

In [None]:
# Make sure that entries were indeed removed

EPFLcourses[EPFLcourses['field'].isna()].head(n=5)


# **ADVANCED EXERCISE**

*Optional.* If time permits and you feel comfortable with Python, continue with the advanced parts of this exercise below.

## Part 4: Count the number of courses in each field

Let's find out how many courses EPFL offers in each field. For that, we can use Pandas' `.grouby()` function. 

**Q 1**: Group the data by `field`. How many Architecture (AR) courses are offered at EPFL?

Hint: use the `.agg('count')` function in Pandas. Be careful though, is the function robust to missing values? 

In [None]:
# Group by field and count the number of course entries

EPFLcourses_count = EPFLcourses[['course', 'field']].groupby('field').agg('count')
EPFLcourses_count.head()


**Q 2**: Sort the data in descending order by course count. What field has the most courses?

In [None]:
# Sort the data in descending order to find out what the most common fields are

EPFLcourses_count = EPFLcourses_count.sort_values(by='course', ascending = False)
EPFLcourses_count.head()


## Part 5: Write data to disk

In this part, we write the number of courses per field to disk as an Excel and comma separated (.csv) file.

**Q 1**: Re-organize the Pandas dataframe such that it has two columns: field, course_count.

In [None]:
# Re-organize the Pandas dataframe such that it has two columns: field, course_count

EPFLcourses_count = EPFLcourses_count.reset_index()
EPFLcourses_count.columns = ['field', 'course_count']
EPFLcourses_count.head()


**Q 2**: Write the data to an .xlsx and .csv file.

In [None]:
# Write data to .xlsx, without the index

EPFLcourses_count.to_excel('data/epfl_output.xlsx', index=False)


In [None]:
# Write data to .csv, without the index

EPFLcourses_count.to_csv('data/epfl_output.csv', index=False)
