<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Exports and imports of India Analysis

## Lab. 2. Data Wrangling

Estimated time needed: **30** minutes

## About dataset and a course

<details><summary>Click here for details about the dataset and the course.</summary>

# Incoming data
<ul>
    <li>Country: This field represents the countries involved in the exports and imports.</li>
    <li>Export: This field represents the total value of goods and services exported by India to other countries during a specific period. The value is measured in million US dollars.</li>
    <li>Import: this field represents the total value of goods and services imported by India from other countries during a specific period. The value is measured in million US dollars.</li>
    <li>Total Trade: This field represents the total value of exports and imports combined. It shows the volume of international trade that India has with other countries.</li>
    <li>Trade Balance: This field represents the difference between the total value of exports and the total value of imports. A positive trade balance indicates that India is exporting more than it is importing, while a negative trade balance indicates the opposite.</li>
    <li>Financial Year(start): This field represents the start date of the financial year during which the exports and imports were recorded.</li>
    <li>Financial Year(end): This field represents the end date of the financial year during which the exports and imports were recorded.</li>
</ul>
    
# Target value
<ul>
    <li>Total Trade</li>
</ul>

# Course objectives
In this dataset, you will get data about exports and imports of India(1997-July 2022).</br> 
During this course you will be learning Data Analysis with Python. You will learn how to:
    <li>Find average volume of export and import of India;</li>
    <li>Analyze the structure of export-import operations by partner countries;</li>
    <li>Examine the impact of the volume of exports and imports on the overall trade balance;</li>
    <li>Estimate the planned volume and structure of imports and exports from India's partner countries for the next period,</li>
    
using Python libraries.
</details>

## Objectives

After completing this lab you will be able to:

*   Handle missing values
*   Correct data format
*   Standardize and normalize data


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ul>
    <li><a href="https://#identify_handle_missing_values">Identify and handle missing values</a>
        <ul>
            <li><a href="https://#identify_missing_values">Identify missing values</a></li>
            <li><a href="https://#deal_missing_values">Deal with missing values</a></li>
            <li><a href="https://#correct_data_format">Correct data format</a></li>
        </ul>
    </li>
    <li><a href="https://#data_standardization">Data standardization</a></li>
    <li><a href="https://#data_normalization">Data normalization (centering/scaling)</a></li>
    <li><a href="https://#binning">Binning</a></li>
    <li><a href="https://#indicator">Indicator variable</a></li>
    <li><a href="https://#cross_tables">Cross Tables</a></li>
    <li><a href="https://#basic_sorting">Basics of sorting</a></li>
    <li><a href="https://#basic_grouping">Basics of grouping</a></li>
</ul>

</div>

<hr>


<h2>What is the purpose of data wrangling?</h2>


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


<h3>Import data</h3>
<p>
You can find the dataset "Imports and Exports of Inidia" from the following link: <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0EYQEN/India_trades.csv">https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0EYQEN/India_trades.csv</a>. 
We will be using this dataset throughout this course.
</p>


<h4>Import libraries</h4> 


In [ ]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

<h2>Reading the dataset from the URL and adding the related headers</h2>


First, we assign the URL of the dataset to "path".


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0EYQEN/India_trades.csv'

Use the Pandas method <b>read_csv()</b> to load the data from the web address.


In [ ]:
df = pd.read_csv(path)

Use the method <b>head()</b> to display the first five rows of the dataframe.


In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.head()

As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis.

<div>So, how do we identify all those missing values and deal with them?</div> 

<b>How to work with missing data?</b>

Steps for working with missing data:

<ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>


<h2 id="identify_handle_missing_values">Identify and handle missing values</h2>

<h3 id="identify_missing_values">Identify missing values</h3>



<h3 id="correct_data_format">Correct data format</h3>

<p>During data cleaning, it is important to check and make sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use:

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>


<h4>Let's list the data types for each column</h4>


In [ ]:
df.dtypes

<p>As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'export', import', 'Total Trade' and 'Trade Balance' and variables are numerical values that describe the import, export, Total Trade and Trade Balance of India  in millions of dollars, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'.</p> 


Let's take a look on our dataset and see, why datatypes could have been incorrectly defined.


In [ ]:
df[70:85]

As we can see, In some values, hundreds are separated by a comma, and look like: "1,087.24". That's why data type of 'export', import', 'Total Trade' and 'Trade Balance' are incorrectly defined. We have to convert data types into a proper format for each column using the "pd.to_numeric()" method.</p> 


In [ ]:
for col in ["Export", "Import", "Total Trade"]:
    df[col] = pd.to_numeric(df[col].str.replace(',', ''))

<h4>Convert data types to proper format</h4>


In [ ]:
df['Country'] = df['Country'].astype('category')

In [ ]:
df.dtypes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #1: </b>

<b>Change datatype of Trade Balance, using <code>pd.to_numeric()</code>, replace all ',' with ''.</b>

<b><h2>Attention!</h2></b>
<b>Don't skip this task, you wouldn't be able to finish this laboratory work.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
df["Trade Balance"] = pd.to_numeric(df["Trade Balance"].str.replace(',', ''))

<details><summary>Click here for the solution</summary>

```python
df["Trade Balance"] = pd.to_numeric(df["Trade Balance"].str.replace(',', ''))
```

</details>


<h4>Evaluating for Missing Data</h4>

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = df.isnull()
missing_data.head(5)

"True" means the value is a missing value while "False" means the value is not a missing value.


<h4>Count missing values in each column</h4>
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method ".value_counts()" counts the number of "True" values. Let's create a function that will count missing data and call it:
</p>


In [ ]:
def missing_data():
    missing_data = df.isnull()
    missing_data.head(5)
    for column in missing_data.columns.values.tolist():
        print(column)
        print (missing_data[column].value_counts())
        print("")

In [ ]:
missing_data()

Based on the summary above, each column has 5994 rows of data and four of the columns containing missing data:

<ol>
    <li>"Export": 8 missing data</li>
    <li>"Import": 552 missing data</li>
    <li>"Total Trade": 585 missing data</li>
    <li>"Trade Balance" : 586 missing data</li>
</ol>


<h3 id="deal_missing_values">Deal with missing data</h3>
<b>How to deal with missing data?</b>

<ol>
    <li>Drop data<br>
        a. Drop the whole row<br>
        b. Drop the whole column
    </li>
    <li>Replace data<br>
        a. Replace it by mean<br>
        b. Replace it by frequency<br>
        c. Replace it based on other functions
    </li>
</ol>


Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others.Let's take a look on each method:

<b>Replace by mean:</b>

<ul>
       <li>"Export": 8 missing data, replace them with mean</li>
    <li>"Import": 552 missing data, replace them with mean</li>
    <li>"Total Trade": 585 missing data, replace them with mean</li>
    <li>"Trade Balance" : 586 missing data, replace them with mean</li>
   
</ul>
In our case, it would be quite incorrect to use this method of replacing data.

</br><b>Replace using interpolation:</b></br>
<ul>
    <li>We can replace missing values of "Import" and "Export" using <code>Pandas.interpolate()</code>.</li>
</ul>

</br><b>Replace by other functions:</b></br>

<ul>
    <li>We can replace missing values of "Total Trade" and "Trade Balance" using these formulas:
        <ul>
            <li>"Total Trade" = "Export" + "Import"</li>
            <li>"Total Balance" = "Export" - "Import"</li>
        </ul>
    </li>
</ul>


<h4>Replace using interpolation</h4>
<p>
Interpolation is a technique in mathematics and computer science that involves estimating values for points between known data points. The basic idea is to use the known data points to estimate values for points that lie in between them.

We will use linear interpolation. This method involves connecting two known data points with a straight line and using the line to estimate the intermediate values.
Let's use <code>Pandas.interpolate()</code> to find missing values.
</p>


In [ ]:
for country in df.Country.unique():
    for col in ["Export", "Import"]:
        df.loc[df.Country == country, col] = df[df.Country == country][col].interpolate()

In [ ]:
df[75:95]

As we can see, missing values in columns "Export" and "Import" now filled with values.


Now, let's deal with missing values in "Total Trade".


In [ ]:
df["Total Trade"].fillna(df["Export"] + df["Import"], inplace=True)
df[75:95]

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #2: </b>

<b>Deal with missing values in "Trade Balance", using <code>fillna()</code>, and formula: "Total Balance" = "Export" - "Import"</b>

<b><h2>Attention!</h2></b>
<b>Don't skip this task, you wouldn't be able to finish this laboratory work.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
df["Trade Balance"].fillna(df["Export"] - df["Import"], inplace=True)

<details><summary>Click here for the solution</summary>

```python
df["Trade Balance"].fillna(df["Export"] - df["Import"], inplace=True)
```

</details>


Now, we can try again to count missing values in each column.


In [ ]:
missing_data()

As we can see, there are still 229 NaN values in columns "Import", "Total Trade" and "Trade Balance".
Let's determine, where are these values:


In [ ]:
df.loc[df.isnull() ["Import"] == True]

As we can see, there are missing values, that weren't interpolated. That's, because, <code>interpolate()</code> function cannot access that values, and that's why they weren't interpolated.


Let's drop rows with NaN values:


In [ ]:
df.dropna(inplace=True)

Now, let's try again to count missing values in each column.


In [ ]:
missing_data()

<b>Wonderful!</b>

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.


<h2 id="data_standardization">Data Standardization</h2>
<p>
Data is usually collected from different agencies in different formats.
(Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)
</p>

<b>What is standardization?</b>

<p>Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
</p>

<b>Example</b>

<p>Transform US dollars (USD) into Indian rupees (INR)</p>
<p>In our dataset, the columns "Export", "Import", "Total Trade" and "Trade Balance" are represented in US Dollars (in millions). Assume, we need to convert "Total Trade" into Indian rupees (INR).</p>
<p>We will need to apply <b>data transformation</b> to transform US dollars (USD) into Indian rupees (INR)</p>
<p>As of 06.04.2023, the exchange rate of the rupee to the dollar is 81,91 rupee per 1 US Dollar.<p>


In [ ]:
df.head()

<p>Let us start by solving the issue with obtaining current exchange rate. We will use <code>pyfetch()</code>method to make HTTP requests to official Binance API and fetch exchange rate data from it. </p>


Returned value is in <strong>JSON format<strong>.

<pre><em><strong>JavaScript Object Notation (JSON)</strong></em> is a standard text-based format for representing structured data based on JavaScript object syntax. It is frequently employed for data transmission in online applications (e.g., sending some data from the server to the client, so it can be displayed on a web page, or vice versa).</pre>

Then we should check the HTTP status response. <strong>200 (OK success)</strong> code indicates that the request has succeeded. For obtaining current rate we need to access <em>"toAssetMinAmount"</em> field in our response.


In [ ]:
# get updated USDT rate
import requests
import json
response = requests.get('https://api.frankfurter.app/latest?from=USD&to=INR')
response = json.loads(response.text)
# if the API is unavailable we set fixed rate
try:
    rate = float(response['rates']['INR'])
except:
     rate = 81.91
print(rate)

# There is no binance api, that converts USD to INR

Now, convert US dollars (USD) into Indian rupees (INR) using current exchange rate


In [ ]:
# Convert US dollars (USD) into Indian rupees (INR) using current exchange rate
df['Total Trade (in rupees)'] = df["Total Trade"] * rate

# check your transformed data 
df.head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #3: </b>

<b>According to the example above, transform US dollars (USD) into Indian rupees (INR) in the column of "Trade Balance (in rupees)".</b>
<br/>The exchange rate of the rupee to the dollar is 81,91 Rupee per 1 US Dollar
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
# Convert US dollars (USD) into Indian rupees (INR) using current exchange rate
df['Trade Balance (in rupees)'] = df["Trade Balance"] * rate

# check your transformed data 
df.head()

<details><summary>Click here for the solution</summary>

```python
# Convert US dollars (USD) into Indian rupees (INR) using current exchange rate
df['Trade Balance (in rupees)'] = df["Trade Balance"] * 81.91

# check your transformed data 
df.head()

```

</details>


<h2 id="data_normalization">Data Normalization</h2>

<b>Why normalization?</b>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the columns "Import" and "Export".</p>
<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>
<p><b>Approach:</b> replace original value by (original value)/(maximum value)</p>


In [ ]:
# replace (original value) by (original value)/(maximum value)
df['Import (normalized)'] = df['Import']/df['Import'].max()
df['Export (normalized)'] = df['Export']/df['Export'].max()
df

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question #4: </b>

<b>According to the example above, normalize the column "Total Trade".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
df['Total Trade (normalized)'] = df['Total Trade']/df['Total Trade'].max() 

# show the scaled columns
df[["Import (normalized)","Export (normalized)","Total Trade (normalized)"]].head()

<details><summary>Click here for the solution</summary>

```python
df['Total Trade (normalized)'] = df['Total Trade']/df['Total Trade'].max() 

# show the scaled columns
df[["Import (normalized)","Export (normalized)","Total Trade (normalized)"]].head()


```

</details>


<h2 id="binning">Binning</h2>
<b>Why binning?</b>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<b>Example: </b>

<p>In our dataset, "Export" is a real valued variable that represents amount of export of India to other countries. What if we need to determine which countries receive a large amount of exports from India, which receive a moderate amount, and which receive a small amount (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'Export' column into 3 bins.</p>


<h3>Example of Binning Data In Pandas</h3>


Let's plot the histogram of export to see what the distribution of export looks like.


In [ ]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["Export"])

# set x/y labels and plot title
plt.pyplot.xlabel("export")
plt.pyplot.ylabel("count")
plt.pyplot.title("export bins")

<p>We would like 3 bins of different size bandwidth.</p>
<p>Since we want to include the minimum value of export, we want to set start_value = min(df["Export"]).</p>
<p>Since we want to include the maximum value of export, we want to set end_value = max(df["Export"]).</p>
<p>Since we are building 3 bins of different length, let's say, that medium amount of export will start from 400, and high amount of export will start from 1000.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [ ]:
bins = [min(df["Export"]), 400, 1000, max(df["Export"])]
bins

We set group names:


In [ ]:
group_names = ['Low', 'Medium', 'High']

We apply the function "cut" to determine what each value of `df['Export']` belongs to.


In [ ]:
df['Export-binned'] = pd.cut(df['Export'], bins, labels=group_names, include_lowest=True )
df[['Export','Export-binned']].head(20)

Let's see the number of values in each bin:


In [ ]:
df['Export-binned'].value_counts()

Let's plot the distribution of each bin:


In [ ]:
pyplot.bar(group_names, df["Export-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("Export")
plt.pyplot.ylabel("count")
plt.pyplot.title("Export bins")

<p>
    Look at the dataframe above carefully. You will find that the last column provides the bins for "export" based on 3 categories ("Low", "Medium" and "High"). 
</p>
<p>
    We successfully narrowed down the intervals to 3!
</p>


<h3>Bins Visualization</h3>
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "export" with bins = 3
plt.pyplot.hist(df["Export-binned"], bins = 3)

# set x/y labels and plot title
plt.pyplot.xlabel("Export-binned")
plt.pyplot.ylabel("count")
plt.pyplot.title("Export bins")

The plot above shows the binning result for the attribute "Export".


<h2 id="indicator">Indicator Variable (or Dummy Variable)</h2>
<b>What is an indicator variable?</b>
<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

<b>Why we use indicator variables?</b>

<p>
    We use indicator variables so we can use categorical variables for regression analysis in the later modules.
</p>
<b>Example</b>
<p>
    We see the column "export-binned" has three unique values: "low", "medium" or "high". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "export-binned" to indicator variables.
</p>

<p>
    We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type. 
</p>


In [ ]:
df.columns

Get the indicator variables and assign it to data frdummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()ame "dummy_variable\_1":


In [ ]:
dummy_variable_1 = pd.get_dummies(df["Export-binned"], prefix="Export")
dummy_variable_1.head(20)

In [ ]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "export-binned" from "df"
df.drop("Export-binned", axis = 1, inplace=True)

In [ ]:
df.head()

<h2 id="cross_tables">Cross Tables</h2>
<p>
Another method you can use to explore your dataset is to create cross table.</br>

A cross table is a table used to display the relationship between two or more categorical variables. 
It is used to summarize and display the distribution of two or more variables in a table format. </p><p>
The table displays the frequency of each combination of variables, providing a way to examine patterns and relationships between the variables. 
Cross tables are commonly used in statistical analysis and research to analyze the relationship between variables.</p>
<p>
We use <code>pandas.crosstab()</code> function to read the csv file. In the brackets, we put such arguments:<br>
<ul>
    <h3>Required:</h3>
    <li><code>"index"</code>: this argument represents the row labels for the resulting cross-tabulation. It can be a column name or a list of column names representing the variables to group by.</li>
    <li><code>"columns"</code>: this argument represents the column labels for the resulting cross-tabulation. It can be a column name or a list of column names representing the variables to group by.</li>
    <h3>Optional:</h3>
    <li><code>"values"</code>This argument is optional and represents the values to aggregate in the resulting cross-tabulation. It can be a column name or a list of column names representing the variables to aggregate. If not specified, the function will count the frequency of occurrences.</li>
    <li><code>"aggfunc"</code>: this argument is optional and specifies the aggregation function to use for the values in the resulting cross-tabulation. By default, it is set to <code>None</code>, which means that the function will count the frequency of occurrences.</li>
    <li><code>"rownames"</code>: this argument is optional and allows you to specify the names for the rows in the resulting cross-tabulation.</li>
    <li><code>"colnames"</code>: this argument is optional and allows you to specify the names for the columns in the resulting cross-tabulation.</li>
    <li><code>"margins"</code>: this argument is optional and specifies whether to compute row and column totals. By default, it is set to <code>False</code>.</li>
    <li><code>"margins_name"</code>: this argument is optional and specifies the name of the row and column total labels. By default, it is set to <code>'All'</code>.</li>
    <li><code>"dropna"</code>: this argument is optional and specifies whether to exclude missing values (NaN) from the resulting cross-tabulation. By default, it is set to <code>True</code>.</li>
    <li><code>"normalize"</code>: this argument is optional and specifies whether to compute relative frequencies rather than absolute frequencies. By default, it is set to <code>False</code>.</li>

</ul>
</p>


<p>Let's try to create a cross table using <code>pd.crosstab()</code>with the following arguments:</p>
<ul>
    <li><code>index=df['Country']</code>: take 'Country' values as rows in our cross table.</li>
    <li><code>columns=df['Financial Year(end)']</code>: take 'Financial Year(end)' values as columns in our cross table.</li>
    <li><code>values=df['Total Trade']</code>: specify 'Total Trade' values as values, that will be stored in cells in our cross table.</li>
    <li><code>aggfunc='sum'</code>: specify the aggregation function to be used when calculating the cell values. in our cross table.</li>
</ul>


In [ ]:
# Create cross table
cross_table = pd.crosstab(index=df['Country'], columns=df['Financial Year(end)'], values=df['Total Trade'], aggfunc='sum')

# Display cross table
cross_table

We can visualize our cross table, using heatmap. As an example, let's take first 10 rows of our cross table.


In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize = (12,5))
sns.heatmap(cross_table[0:10], linewidths=.5)

As the result, we can see summary information about total trade of every country annualy since 1998 till now.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question #5: </b>
    
<p>Create a cross table: as rows take "Country", as columns take "Financial Year(start)", as values take "Trade Balance", and add margin "Total".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute
# Create cross table
cross_table = pd.crosstab(index=df['Country'], columns=df['Financial Year(end)'], values=df['Total Trade'], aggfunc='sum', margins=True, margins_name="Total")

# Display cross table
cross_table

<details><summary>Click here for the solution</summary>

```python

# Create cross table
cross_table = pd.crosstab(index=df['Country'], columns=df['Financial Year(end)'], values=df['Total Trade'], aggfunc='sum', margins=True, margins_name="Total")

# Display cross table
cross_table
```
</details>


<h2 id="basic_sorting">Basics of Sorting</h2>


Data sorting is an essential process in data analysis that involves organizing data in a specific order based on one or more columns. Sorting data is critical because it enables analysts to discover patterns, relationships, and outliers in the data easily. By sorting data in ascending or descending order based on a specific column, analysts can easily identify the highest or lowest values, find duplicates, or group data based on a particular attribute. This process can be particularly useful when working with large datasets where it can be challenging to identify relevant information without first sorting it. Overall, data sorting is a fundamental process that is necessary for understanding and drawing insights from data, making it an integral part of the data analysis process.


Let's try to sort our data by country in ascending order and by year since 2002 till 1997 in descending order.


In [ ]:
sorted_df = df[df["Financial Year(start)"].between(1997, 2002)]
sorted_df = sorted_df.sort_values(["Country", "Financial Year(start)"], ascending=[True, False])

In [ ]:
# sorted_df["Financial Year(start)"].head(20)
sorted_df.head(20)

As we can see, the countries in the DataFrame are sorted in descending order based on their inclusion within the range of 1997-2002, and then in ascending alphabetical order.


<h2 id="basic_grouping">Basics of Grouping</h2>


<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable "country".</p>


In [ ]:
# Group the data by country
grouped = df.groupby('Country')

# Calculate the total export, import, and trade balance for each group
total_export = grouped['Export'].sum()
total_import = grouped['Import'].sum()
trade_balance = total_export - total_import

# Create a new DataFrame with the grouped data and the calculated totals
summary_df = pd.DataFrame({
    'Total Export': total_export,
    'Total Import': total_import,
    'Trade Balance': trade_balance
})

# Add the country column by resetting the index of the grouped DataFrame
summary_df['Country'] = grouped['Country'].first().values
summary_df = summary_df.reset_index(drop=True)[['Country', 'Total Export', 'Total Import', 'Trade Balance']]


In [ ]:
summary_df

We can also try to group our data by "Financial Year(start)" and "Financial Year(end)"


In [ ]:
# Group the data by financial year
grouped_by_year = df.groupby(['Financial Year(start)', 'Financial Year(end)']).agg({'Export': 'sum', 'Import': 'sum'})

# Calculate the trade balance
grouped_by_year['Trade Balance'] = grouped_by_year['Export'] - grouped_by_year['Import']

# Reset the index to get the financial year as columns
grouped_by_year = grouped_by_year.reset_index()

# Print the resulting dataframe
grouped_by_year


#Correct

As we can see, we can see data about annual export, import and trade balance. This can be very useful during data analysis.


Let's use a heat map to visualize our grouped data.


In [ ]:
# Create a pivot table using the summary_df DataFrame
grouped_pivot = grouped_by_year.pivot_table(index='Financial Year(start)', values=['Export', 'Import', 'Trade Balance'])

# Print the pivot table
grouped_pivot


In [ ]:
#use the grouped results
plt.figure(figsize = (12,10))
sns.heatmap(grouped_pivot, linewidths=.4)

<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>


<h3>Save data</h3>


In [ ]:
df.to_csv('clean_df.csv', index=False)

Save the new csv:

> Note : The  csv file cannot be viewed in the jupyterlite based SN labs environment.However you can Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Module%202/DA0101EN-2-Review-Data-Wrangling.ipynb?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01">HERE</a> to download the lab notebook (.ipynb) to your local machine and view the csv file once the notebook is executed.


### Thank you for completing this lab!

## Authors

<a href='https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0EYQEN2912-2023-01-01'> Yaroslav Vyklyuk</a>

<a href='https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0EYQEN2912-2023-01-01'>Olga Kavun</a>

<a href='https://author.skills.network/instructors/petro_slobodian?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0EYQEN2912-2023-01-01'>Petro Slobodian</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
| 2023-04-04        | 1.0     | Slobodian    | Created Laboratory work                            |



<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
