# Data Analytics and Visualization (part 1)

The best way to learn programming is by solving real-world problems. That's why this course is designed around a common and practical scientific task: ***data analysis***. 

Throughout this part of the workshop, we will work with a dataset that contains valuable information about the physical oceanography at Potter Cove, located on King George Island in Antarctica. 

This dataset includes measurements collected using a device known as a CTD, which stands for Conductivity, Temperature, and Depth. The data was gathered between January 4 and February 3, 1994, and includes important details such as: 

- ***water depth [meters]***
- ***temperature [Celsius]***
- ***salinity levels***
- ***water density [kg/m**3]***

In total, we have 774 individual measurements from this study, which will help us understand the ocean conditions in this unique area.

## Load Pandas

**Python** doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions.


In [38]:
#Your code goes here

When we invoke a function from a library we use the following syntax: **LibraryName.FunctionName**, in this case we can call it **pandas.FunctionName**

By giving *pandas* a *nickname* such as **pd**, it makes our lives easier now that we can call the function **pd.FunctionName** instead.
This smart trick allows us to avoid typing out the full “pandas” keyword every time we use a function from the Pandas library.


## Read CSV file using Pandas

Pandas  can be used to import data stored in a Comma-Separated Values (CSV) file format. CSV is a common and simple way of structuring tabular data, where each line corresponds to a row and the values within a line are separated by commas.


In [39]:
#Your code goes here

This code returns an overview of how the dataset looks like, returning the first and last five rows. 

The **read_csv** function has successfully processed our file but has not yet stored it into memory for further processing and analysis, so to do this we will add a new variable called “df”, short for dataframe:

In [40]:
#Your code goes here

If the dataset contains many samples then it is a good idea to use the **head()** function of Pandas to see the first few samples of the dataset. The function head() by itself returns the first 5 rows, but we can also specify how many rows we want to display by adding a number as a parameter in the function: **head(*10*)**

In [41]:
#Your code goes here

We can also check what kind of things **df** contains using **dtypes**. What kind of data types our dataframe contains:

In [42]:
#Your code goes here

## Explore the DataFrame Object

Let’s explore the DataFrame Object further. We will be using both methods and attributes.

**Methods** are functions that we can apply to the DataFrame to perform specific operations. They usually require parentheses. If we wish to see the information of a dataframe, we can use the **info()** function:

In [43]:
#Your code goes here

This summary provides valuable information about the DataFrame’s structure, data types, and the presence of missing values. It’s a quick overview that helps you understand the content and characteristics of the DataFrame.

We can use the **unique()** function to identify the distinct values within a column or an array.

In [44]:
#Your code goes here

It returns the unique values in the **Date/Time** column.

**Attributes** are properties of the DataFrame that provide information about its characteristics. They don’t require parentheses. If we wish to see the shape, number of rows and columns, of the dataframe we can use the **shape** attribute:

In [45]:
#Your code goes here

### Exercise 1
What would be the output of the following commands?
-  df.tail()
-  df.columns


In [9]:
#Your exercise code goes here

## Selecting Data Using Labels

To select a single column, use the DataFrame’s name followed by the column label in square brackets **['ColumnLabel']**.

In [46]:
#Your code goes here

We can also use the column name as an *attribute* to access data from that column using **df.Temperature**


In [47]:
#Your code goes here

To select multiple columns, enclose the column labels in double square brackets **[['Column1', 'Column2']]**.

In [48]:
#Your code goes here

We can also create a new object and store the result, and later we can access the result from the object.

In [49]:
#Your code goes here

### Exercise 2
What happens if you ask for a column that doesn’t exist?
-  df['Time']


In [14]:
#Your exercise code goes here

## Extracting Range-based Subsets (Slicing)
Slicing is a technique used to extract a portion or subset of elements from a sequence, such as a list or string. It allows us to specify a range of indices to retrieve a subset of the data.

-  Getting Specific Elements
-  Getting a Set of Elements
-  Getting First Few Elements
-  Getting Last Few Elements

Let's go through a simple example of a list before moving back to dataframes:


In [50]:
#Your code goes here

### Exercise 3
What would be the output of the following command
-  my_list[len(my_list)]


In [16]:
# Your exercise code goes here

## Slicing Rows and Columns
Slicing rows and columns simultaneously involves using **.loc** or **.iloc** and specifying the row indices and column labels or indices we want to include.

-  **.loc** is label-based indexing, meaning we specify the row and column labels.
-  **.iloc** is integer-based indexing, meaning we use integer indices for rows and columns.

#### Using .loc 

In [51]:
#Your code goes here

Now, if we want to select **‘Date’, ’Water_Depth’, and ‘Salinity_Level’** columns with row labels **“1, 3, 4”**, we can also do this using the below code:

In [52]:
#Your code goes here

### Using .iloc: 

In [53]:
#Your code goes here

In both cases, using *.loc* or *.iloc*, the first argument specifies the rows to include, and the second argument specifies the columns to include. 

## Subsetting Data using Criteria
Subsetting data using criteria involves selecting a subset of rows from a DataFrame based on specific conditions. This is often done to filter out rows that meet certain criteria or to focus on specific data points that are relevant to our analysis. 

We can use conditional statements to filter rows based on specific criteria. The condition is typically applied to a column, and rows meeting the condition are retained.

For example, let’s say we want to subset the DataFrame to only include samples with water depth greater than 25 meters:


In [54]:
#Your code goes here

Also, we can combine multiple criteria using logical operators such as **&** *(AND)* and **|** *(OR)* to create more complex conditions.

For instance, to subset the DataFrame for samples with water depth greater than 25 meters and salinity levels greater than 33.

In [55]:
#Your code goes here

We can also use the **~** symbol to negate a condition. For example, to subset the DataFrame for samples with a water depth less than or equal to 25 meters:

In [56]:
#Your code goes here

The **isin()** function is used to filter data based on whether values are present in a specified list or iterable. It’s a convenient way to subset data when we want to select rows that match specific values for a particular column.

Let’s say we want to select rows where the **‘Date’** column has values **‘1994-01-04’** or **‘1994-01-11’**:


In [57]:
#Your code goes here

**isnull()** and **notnull()** functions are used to detect missing (NaN) values in a DataFrame. **isnull()** returns a DataFrame of the same shape as the input, with True values indicating missing values. **notnull()** returns the opposite.

Let’s say we want to select rows where the ‘Water_Depth’ column *has missing values*:


In [58]:
#Your code goes here

Let’s say we want to select rows where the ‘Water_Depth’ column *does not have missing values*:

In [59]:
#Your code goes here

## Calculating Statistics from Pandas DataFrame
We can use Pandas DataFrame’s built-in methods to quickly generate summary statistics for our data. Such as, we can use the **describe()** function to get summary statistics for numerical columns like count, mean, standard deviation, minimum, and maximum.

In [60]:
#Your code goes here

If we want to calculate the standard deviation of a numerical column we can use **std()** function.

In [61]:
#Your code goes here

There are many more statistics formulas that you can use, I encourage you to check out the following resources:
-  https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm
-  https://www.scaler.com/topics/pandas/statistical-functions-in-pandas/

I promise you will have fun!

## Groups in Pandas

Frequently, there’s a need to compute summary statistics based on subsets or specific attributes within our dataset. For instance, we might wish to find the summary statistics of the water density of all the samples, we can do it using the following code:

In [62]:
#Your code goes here

Again, we might also want to get only specific information, like the maximum:

In [63]:
#Your code goes here

or we can get the average income of all individuals:

In [64]:
#Your code goes here

However, when the intention is to summarize data based on one or more variables, such as Date, the Pandas library offers the **.groupby** method. Once a DataFrame is grouped using this approach, we have the ability to compute summary statistics of the selected grouping.

In [65]:
#Your code goes here

## Basic Math with Pandas
If desired, it’s entirely possible to perform mathematical operations, such as addition or division, on an entire column of our dataframe. 

 
Let's multiply the Temperature column by 2:

In [66]:
#Your code goes here

## Concatenating DataFrames
Concatenating DataFrames refers to combining two or more DataFrames along a particular axis (either rows or columns) to create a single larger DataFrame. This is useful when we have data split across multiple DataFrames and we want to consolidate them into one for analysis or processing.

In Pandas, we can use the **concat()** function to concatenate DataFrames. This function provides various options to control how the concatenation should be performed. 

Let’s say we have two DataFrames, **df1** and **df2**, and we want to concatenate them vertically (along rows):

In [67]:
#Your code goes here

In this example, **pd.concat()** is used to concatenate df1 and df2 vertically into concatenated_df. The **ignore_index=True** argument ensures that the index is reset after concatenation.

We can also concatenate DataFrames **horizontally** by specifying **axis=1** as an argument to **pd.concat()**. This will merge the DataFrames along columns.

In [68]:
#Your code goes here

### Exercise 4
Consider two DataFrames, df1 and df2, with the following data

**import pandas as pd**

**data1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}**

**data2 = {'A': [7, 8, 9], 'B': [10, 11, 12]}**

**df1 = pd.DataFrame(data1)**

**df2 = pd.DataFrame(data2)**

What will be the output of the following code:

**result = pd.concat([df1, df2], axis=1)**

**print(result)**

Select the correct answer ***(without running the code)***:

a) The concatenated DataFrame with columns A, B, A, B

b) An error will occur because columns A and B are duplicated

c) The concatenated DataFrame with columns A, B, C, D



In [69]:
#Your code goes here

## Saving Pandas DataFrame

We can save a Pandas DataFrame to various file formats using different methods provided by Pandas. Before we move forward with saving  a pandas dataframe, let’s first create a new directory called “Results” within the directory that contains your code.

Here are some commonly used methods to save a DataFrame:
-  CSV Format: To save a DataFrame to a CSV file, we can use the to_csv() method:


In [70]:
#Your code goes here

This will save the DataFrame to a CSV file named ‘output.csv’ inside a directory called “Results”, without including the index.

-  Excel Format: To save a DataFrame to an Excel file, we can use the to_excel() method:

In [71]:
#Your code goes here

This will save the DataFrame to an Excel file named ‘output.xlsx’  inside a directory called “Results”, without including the index.


-  Other Formats: Pandas supports various other formats, including JSON, Parquet, HDF5, and more. We can use the appropriate method based on the desired format:

    -   JSON: df.to_json("output.json", orient="records")
    -   Parquet: df.to_parquet("output.parquet")
    -   HDF5: df.to_hdf("output.h5", key="data")

Make sure to replace ‘output’ with your desired file name and extension.
