# Big Data Processes exercise - week 1

## <font color = green>Recap</font>

- There are two types of cells: markdown cells and code cells
- To add a markdown cell, click on "+ Markdown" above
- To add a code cell, click on "+ Code" above
- Writing notes in the notebook can be super helpful. You can write notes in Markdown cells or alternatively, in the code cells. If you want to write a note in a code cell, remember to start the sentence with #. This signals that everything after the '#' is a comment and this line will then not be executed when running the cell. 

Good shortcuts to keep in mind:
<ul>
    <li><i> Shift+Enter </i>: Execute a cell and advance to the next one</li>
    <li><i> Ctrl+Enter (Windows) or Cmd+Enter (Mac) </i>: Execute a cell and remain on it</li>

</ul>

#### **<font color = Orange>1. Importing an external library - Pandas</font>**

The first thing we need to do is to import a so-called 'library'. 
In Python, a library refers to a collection of pre-written code or modules that we can use to perform specific tasks, like manipulating a dataset. 
It adds extra functionalities and tools to the Python version we just downloaded.
If you are curious about libraries for Python, you can check this link out: https://www.edureka.co/blog/python-libraries/. It might not make sense now, but it can be useful later on. 

The library we want to import in this notebook is called Pandas. It contains functions that are useful when working with datasets and data analysis.
Data scientists often use 'pd' as a deliberate alias for Pandas. It allows shorter references later on. We will use this alias as well (see the code cell below) 

To learn more about the Pandas library, follow this link: https://pandas.pydata.org/

For a beginner-friendly tutorial on Pandas, click here: https://www.w3schools.com/python/pandas/default.asp

Okay, now try running the code cell below to import the Pandas library. Do it now, before reading any further...

In [1]:
import pandas as pd

You probably got a **<font color = red>"ModuleNotFoundError"</font>**

This is because you have not installed the library yet! Before we show you how to install a new library such that you can then import it, we'll just gothrough the difference between installing and importing a library:

- **Installing:** This is a one-time setup process where you download and install a library and its dependencies on your system. It's typically done using a package manager like pip.
- **Importing:** This is the process of making a library available for use in your Python code. Once installed, you import the library using the import statement to access its features.

In practice, the typical workflow involves installing a library once and importing it whenever you need to use it in your code. The installation step ensures that the library is present on your system, while the import step allows you to access and utilize the library's functionalities in your Python scripts or programs.

REMEMBER, whenever you want to use a library that you have not used before, you will need to install it first.

Okay, let's install Pandas, so you can import it. To do this:
- Open a terminal (go to View > Terminal in the dropdown menu)
- type: pip install pandas
- Hit enter
- Wait for the library to be downloaded

After this, try running the code cell above again - now, it should work. You now (hopefully) successfully installed and imported the library Pandas!
If you continue to have issues, let the TAs know. This can be related to connecting your file to your Python environment, where the download/install is occurring.

#### **<font color = Orange>2. Importing the dataset</font>**

Let's try to use Pandas for importing a dataset!
The first dataset we will open is the one called 'Data.txt' which is placed in the same folder as this notebook.
Before moving to the code below, open the text.file 'Data.txt' by double clicking on it. 
As you can see, it looks a bit like a table with the columns: Country, Age, Salary, Purchased...

Okay, now let's try to "transfer" the data in the txt-file into a variable in our notebook using Pandas. You do this by running the code cell below.

In [2]:
dataset_csv = pd.read_csv('Data.txt', delimiter = '\t')

Explanation of the code above:
- "dataset_csv" is the variable name we have chosen for our dataset (in Pandas-lingo what is stored in our variable is actually called a "dataframe"). As "dataset_csv" is a variable we create, we decide what the name should be. Some find it helpful to name their datasets after the content of the file, e.g, "health", others just name their dataframes df, and that is perfectly okay too. You just need to remember that you will be writing the name of the dataframe often and names like: the_health_data_of_several_different_countries might not be the most effective name. As you progress you might work with several different dataframes in your Jupyter Notebook, so keeping a list of what the different dataframes contain might be helpful 
- "pd" is the alias we used for the library Pandas. In other words, we are telling Python that we want to use Pandas to read the file (= "pd.read_csv")
- "Data.txt" is the name of the file we want to open. To be able to open it like this, the Jupyter Notebook and the file needs to be saved in the same folder
- "delimiter" is a so-called 'parameter' which we use to tell Python how the values are separated in the file. As you might remember when examining the txt-file, the "columns" in the Data.txt is seperatedby tabs (spaces) and therefore, we choose the delimiter '\t' (which stands for tab)

If you get a <font color = red><b>"FileNotFoundError"</b></font>, it is because the notebook cannot find the file you are trying to load. The issue is often that the file is not in the same folder as the Jupyter Notebook file. There are two solutions to this: 

1) Move the txt-file into the correct folder (= the same folder as the notebook)
OR 
2) Add the whole path of the file. This can usually be extracted by right-clicking the file itself, and hitting "copy path". When adding to your notebook, you need to add an "r" before the file path. This is to tell the notebook that the following is just a name and that it should ignore the backspaces and not read them as code. The line will now look like this:  

<i>dataset_tsv = pd.read_csv(r'C:\Users\emili\Downloads\Data.txt', delimiter = '\t')</i>

Right now, we are working with a txt-file contaning some data. However, there are many other types of files that can contain data. If we want to open another type of file, we need to use another method than pd.read_csv. For instance, data can also be stored in an excel-file. In that case, we have to use the method below:

dataset_excel = pd.read_excel('Data.xlsx', index_cols = 0, header = 0)

You can find more information about how to read different file types below - or by asking googe or Chat-GPT ;)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

#### **<font color = Orange>3. Opening the dataset and examining it</font>**

Okay - so we just "transferred" the data from the Data.txt-file into a variable called "dataset_csv".
To take a look at what our new variable contains, we simple write its name in a code cell and run it. Try it out below:

In [3]:
dataset_csv

#Here we are checking what have is stored in the variable and, as expected, it is the data from the txt-file

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
...,...,...,...,...
65,Pakistan,37.0,25000.0,Yes
66,Israel,70.0,80000.0,No
67,Tunisia,42.0,36000.0,No
68,Algeria,20.0,18000.0,No


#### **<font color = Orange>4. Redefining a variable name</font>**
If we want to shorten the name of the variable, we just assign it to a new variable:

In [4]:
ds = dataset_csv

#### **<font color = Orange>5. Using a method</font>**

Below, we present a few methods you can use to display the dataset in different ways.

Using the .head() method on our new variable will give us the first five rows of our dataframe.


In [8]:
ds.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


If you want to display a specific number of rows from the top, you use the .head() method with a number in the brackets:


In [9]:
ds.head(10)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Using a .tail() method will give you the five last rows:

In [15]:
ds.tail()

Unnamed: 0,Country,Age,Salary,Purchased
65,Pakistan,37.0,25000.0,Yes
66,Israel,70.0,80000.0,No
67,Tunisia,42.0,36000.0,No
68,Algeria,20.0,18000.0,No
69,Morocco,60.0,56000.0,Yes


Just like before, you can insert a number into the brackets and it will give you that number of rows:

In [10]:
ds.tail(3)

Unnamed: 0,Country,Age,Salary,Purchased
67,Tunisia,42.0,36000.0,No
68,Algeria,20.0,18000.0,No
69,Morocco,60.0,56000.0,Yes


### You're now ready to try it out yourself! :)

#### **<font color = Green>1. Using markdowns</font>**

Can you turn this question into a green header? 

#### **<font color = Purple>2. Using methods</font>**

Use what you learned so far to find the name of the country with index 48 (remember that the index of the dataframe starts at 0 - not 1):

In [15]:
ds.head(47)

Country       Poland
Age             54.0
Salary       36000.0
Purchased         No
Name: 47, dtype: object

Now, it's time for a challenge!

Search online to figure out how to get Pandas to return the information in the row with index 48 *only* (i.e., not several rows of the dataframe):

In [None]:
ds.iloc[47]

#### **<font color = Purple>3. Creating a new variable</font>**

Can you create a variable named <i> "last_5_rows" </i> and assign the last 5 rows of the dataset to it?

In [17]:
last_5_rows = ds.tail(5)

#### **<font color = Purple>4. Create and open your own file</font>**

Create a .txt file and give it a name. Within that file paste in the following three lines:

One;Two;Three;Four\
1;2;3;4\
I;II;III;IV

See if you can identify the <b>delimiter</b> and open the file yourself.\
<i>(Make sure to save your .txt-file in the same folder as this Jupyter Notebook!)</i>

If you need more information to get going: Start by loading the dataset into your notebook. You can reuse the code from "2. Importing the dataset" - you just need to change the filename.  

Then check to see if you used the right delimiter by using the ".head()" function again. This will print out how your data looks. Does it look right? If not, try changing the delimiter.

In [22]:
small_file_df = pd.read_csv('small_test_file.txt', delimiter = ';')
small_file_df.head()

Unnamed: 0,One,Two,Three,Four
0,1,2,3,4
1,I,II,III,IV


#### **<font color = Purple>5. Opening an excel file</font>**

Try creating an excel file with data in it (or use one you already have) and open it from a different folder than the one your notebook is saved in.



In [52]:
import xlrd
excel_df = pd.read_excel('/Users/simonskodt/MSc_CS/8_semester/BIDAP_Big-Data-Processes/Dummy_excel_files/Orders_with_Null.xlsx')
excel_df.head()

Unnamed: 0,Order ID,Order Date,Order Quantity,Sales,Ship Mode,Profit,Unit Price,Customer Name,Customer Segment,Product Category
0,3,2010-10-13,6,261.54,Regular Air,-213.25,38.94,Muhammed MacIntyre,Small Business,Office Supplies
1,6,2012-02-20,2,6.93,Regular Air,-4.64,2.08,Ruben Dartt,Corporate,Office Supplies
2,32,2011-07-15,26,2808.08,Regular Air,1054.82,107.53,Liz Pelletier,Corporate,Furniture
3,32,2011-07-15,24,1761.4,Delivery Truck,-1748.56,70.89,Liz Pelletier,Corporate,Furniture
4,32,2011-07-15,23,160.2335,Regular Air,-85.129,7.99,Liz Pelletier,Corporate,Technology


In [50]:
# Simple query on order id of 3
ORDER_ID = 3
order_id_1 = excel_df[excel_df['Order ID'] == ORDER_ID]
order_id_1

Unnamed: 0,Order ID,Order Date,Order Quantity,Sales,Ship Mode,Profit,Unit Price,Customer Name,Customer Segment,Product Category
0,3,2010-10-13,6,261.54,Regular Air,-213.25,38.94,Muhammed MacIntyre,Small Business,Office Supplies


# Take home messages

After finishing this notebook, you should know:
- How to install and import a library
- How to store data from different type of files in a variable/dataframe
- How to apply methods to a variable/dataframe to visualize different rows

Still a bit confused? Take a look at this beginner's guide to Pandas: https://www.w3schools.com/python/pandas/default.asp