# Applied Data Lab

# Assignment 01:

## Pandas

Pandas is an open-source software library written in the Python programming language for data manipulation and analysis. It provides data structures and operations for efficiently manipulating numerical tables and time series data.


You can import Python libraries in two ways:

1. Import Pandas with the Alias (Recommended):
   ```python
   import pandas as pd
   ```
   Using an alias like `pd` allows you to access functions within the Pandas library using `pd.function()` notation. This is the recommended approach as it simplifies and shortens your code.

2. Import Pandas Without an Alias:
   ```python
   import pandas
   ```
   Without an alias, you would need to use the full library name, such as `pandas.function()`, which can be less convenient and more verbose.

Using an alias like `pd` is a common convention in the Python data science community and is generally preferred for its simplicity and readability.

In [1]:
# Run this cell
import pandas

## Exercise 1:
Import the Pandas library using the alias `pd`

In [2]:
# Do Exercise in this cell
#
#
#

In [3]:
import pandas as pd

The key to learning pandas is to understand its data structures. There are three main data structures in pandas:

Series — 1D
DataFrame — 2D
Panel — 3D

We will focus on the first two main data structures in Pandas:

1. **Series (1D):**
   - A Series is a one-dimensional labeled array capable of holding any data type.
   - It can be thought of as a single column of data.
   - Each element in a Series has a label or position, which makes it easy to access specific data points.
   - Series is useful for representing time series data, sensor data, or any data where you have a sequence of values.

2. **DataFrame (2D):**
   - A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.
   - It can be thought of as a spreadsheet or a SQL table.
   - DataFrames are the primary data structure used in Pandas for data analysis.
   - They can handle a wide range of data types and are suitable for most tabular data, including CSV files, Excel spreadsheets, SQL tables, and more.
   - DataFrames provide powerful tools for data cleaning, exploration, manipulation, and analysis.

## Setting Up the Address
In this cell, a path variable is set with the value of the current directory where the notebook is open. This is done to easily upload the dataset file from this location.

In [4]:
# Run this cell
import os
PATH = os.getcwd() + '/'
PATH

'/content/'

**ONLY FOR GOOGLE COLAB USERS**

For those who are using **Google Colab**, uncomment and run the cell below.

**Note**: You have to repalce value of variable `YOUR_PATH_TO_DATASET_DIRECTORY` with the path where your dataset is placed in the Google Drive folder.



In [5]:
# from google.colab import drive
# drive.mount('/content/drive/')
# YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab/phase_2"
# PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
# PATH

In [6]:
from google.colab import drive
drive.mount('/content/drive/')
YOUR_PATH_TO_DATASET_DIRECTORY = "work/Applied_Data_Lab_Assignments/phase_2"
PATH = "/content/drive/MyDrive/"+YOUR_PATH_TO_DATASET_DIRECTORY+"/"
PATH

Mounted at /content/drive/


'/content/drive/MyDrive/work/Applied_Data_Lab_Assignments/phase_2/'

###  Importing CSV Dataset with Pandas

To import a CSV dataset using Pandas' `read_csv` function, you need to provide the file path to the CSV file. Here's how to do it:

```python
import pandas as pd

# Specify the file path to the CSV file and import the dataset
file_path = "your_dataset.csv"
data = pd.read_csv(file_path)
```

In this code, `file_path` should point to the location of your CSV file, and `data` will contain your dataset.

## Exercise 2:

Import the CFM_Dataset_Modified.csv in variable `data`

**Hint:** Use `PATH + 'filename.csv'` as the full path argument.


In [7]:
# Do Exercise in this cell
#
#
#

In [8]:
data = pd.read_csv(PATH+'CFM_Dataset_Modified.csv')

You can use the `head()` and `tail()` methods on your dataset to display a portion of the rows. These methods are useful for quickly inspecting the beginning or end of your data. You can also specify how many rows you want to display from the top or bottom.

- `data.head()` will display the first few rows (by default, the first 5 rows) of your dataset.
- `data.head(n)` will display the first `n` rows of your dataset.
- `data.tail()` will display the last few rows (by default, the last 5 rows) of your dataset.
- `data.tail(n)` will display the last `n` rows of your dataset.

These methods are handy for getting an initial view of your data and checking its structure.

## Exercise 3:

Display first 10 rows using head() method

In [9]:
# Do Exercise in this cell
#
#
#

The `info()` method prints information about the DataFrame.

The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

## Exercise 4:
Call the `info()` method on the `data` DataFrame to retrieve details about all the columns/features in the dataset.
`data.info()`

In [None]:
# Do Exercise in this cell
#
#
#

The describe() method returns description of the data in the DataFrame.

If the DataFrame contains numerical data, the description contains these information for each column:

- count - The number of not-empty values.
- mean - The average (mean) value.
- std - The standard deviation.
- min - the minimum value.
- 25% - The 25% percentile
- 50% - The 50% percentile
- 75% - The 75% percentile
- max - the maximum value

## Exercise 5:
Call the `describe()` method on the `data` DataFrame to retrieve statistical information about the numeric columns/features in the dataset. `data.describe()`



In [None]:
# Do Exercise in this cell
#
#
#

Just like these methods, there are also properties of a DataFrame that can be accessed. One of them is `columns`.

The `columns` property returns a list of column names or labels present in the DataFrame. It provides a quick way to see what columns are available in your dataset, which is especially useful when you're dealing with large or unfamiliar datasets.

## Exercise 6:
To access the column names in a DataFrame, you can use the `columns` property. Here's how you can do it:
```python
column_names = data.columns
print(column_names)
```



In [None]:
# Do Exercise in this cell
#
#
#

`shape`: This property returns a tuple representing the dimensions of the DataFrame. The tuple contains two values: the number of rows and the number of columns. (rows, columns)

`dtypes`: This property returns a Series with the data types of each column in the DataFrame. It's useful for checking the data types of your columns.

## Exercise 7:

**Objective:** Access and use the `shape` and `dtypes` properties of a DataFrame.

**Instructions:** (similar to columns)

1. Access the `shape` property of the `data` DataFrame and store the result in a variable called `data_shape`. Print the values of the number of rows and number of columns.

2. Access the `dtypes` property of the `data` DataFrame and store the result in a variable called `data_data_types`. Print the data types of each column in the DataFrame.

In [None]:
# Do Exercise in this cell
#
#
#

## Dealing with Rows and Columns

In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.


- **Single Attribute Name (e.g., `data['column_name']`)**: This returns a Series, which is a one-dimensional labeled array, representing a single column of data.

- **List of Attribute Names (e.g., `data[['column1', 'column2']]`)**: This returns a DataFrame, which is a two-dimensional labeled data structure, representing multiple columns of data.

In [18]:
# Run this cell
data["rank"] # Series

0       1
1       2
2       3
3       4
4       5
     ... 
81    496
82    497
83    498
84    499
85    500
Name: rank, Length: 86, dtype: int64

In [19]:
# Run this cell
data[["rank"]] # DataFrame

Unnamed: 0,rank
0,1
1,2
2,3
3,4
4,5
...,...
81,496
82,497
83,498
84,499


In [20]:
# Run this cell
data[["rank", "revenues"]]

Unnamed: 0,rank,revenues
0,1,485873
1,2,315199
2,3,267518
3,4,262573
4,5,254694
...,...,...
81,496,21903
82,497,21796
83,498,21741
84,499,21655


## Exercise 8: Selecting Columns

**Objective:** Extract specific columns from the dataset as a DataFrame.

**Instructions:**

1. Use the dataset `data`.

2. Extract the following columns as a DataFrame:
   - 'company'
   - 'revenues'
   - 'profits'
   - 'country'

3. Store the extracted columns in a new DataFrame(variable of any name).

4. Print the new DataFrame(variable) to display the selected columns. Using `print` or `display` function

In [None]:
# Do Exercise in this cell
#
#
#

In Pandas, you can select rows from a DataFrame using either the `.loc[]` or `.iloc[]` methods. These methods allow you to access rows by their row label (`.loc[]`) or by their integer index (`.iloc[]`).

**Example Using `.iloc[]`:**

```python
# Select a single row by integer index
row = data.iloc[0]
print(row)
```

## Exercise 9: Selecting Rows

**Objective:** Practice selecting rows from a Pandas DataFrame using
 `.iloc[]` methods.

**Instructions:**

1. Use the provided DataFrame `data`.

2. Select the first row of the DataFrame using `.iloc[]` and store it in a variable called `first_row_loc`.

3. Select the second row of the DataFrame using `.iloc[]` and store it in a variable called `second_row_iloc`.

4. Print both `first_row_loc` and `second_row_iloc` to view the selected rows.


In [33]:
# Do Exercise in this cell
#
#
#

## Selecting Rows and Columns with `.iloc[]` & `.loc[]`

The `.iloc[]` method in Pandas allows you to select specific rows and columns from a DataFrame based on integer indexes. It follows the format `iloc[first_index, second_index]`, where:

- `first_index` refers to the **row** index you want to select.
- `second_index` refers to the **column** index you want to select.

**Example:**

```python
# Select the element in the first row and second column
element = data.iloc[0, 1]
print(element)
```

In the example above, `data.iloc[0, 1]` selects the element in the first row (index 0) and the second column (index 1) of the DataFrame.

## Exercise 10: Selecting Rows and Columns with `.iloc[]`

**Objective:** Practice selecting specific rows and columns using the `.iloc[]` method.

**Instructions:**

1. Use the provided DataFrame `data`.

2. Select and print the element in the third row and fifth column(industry) using `.iloc[]` and print the result.



In [None]:
# Do Exercise in this cell
#
#
#

## Exercise 11: Selecting Rows and Columns with `.loc[]`

**Objective:** Practice selecting specific rows and columns using the `.loc[]` method.

**Instructions:**

1. Use the provided DataFrame `data`.

2. Select and print the element in the third row and industry column using `.loc[]` and print the result.

Hint: `data.loc[3, 'column_name']`

By using loc, you can access rows or columns by their name and not only by index.


In [None]:
# Do Exercise in this cell
#
#
#

## Slicing with `.iloc[]`  & `.loc[]`

You can also use slicing with `.iloc[]` to select a range of rows or columns. For example:

- `data.iloc[0:3]` selects the first three rows.
- `data.iloc[:, 1:4]` selects columns 1 to 3 (excluding 4).

**Example:**

```python
# Select the first three rows and columns 1 to 3
subset = data.iloc[0:3, 1:4]
print(subset)
```

In this example, `data.iloc[0:3, 1:4]` selects a subset of the DataFrame containing the first three rows and columns 1 to 3.



## Exercise: Selecting Rows and Columns with `.iloc[]`

**Objective:** Practice selecting specific rows and columns using the `.iloc[]` method.

**Instructions:**

1. Use the provided DataFrame `data`.


3. Select and print a subset of the DataFrame containing the first five rows and columns 2 to 4 using slicing with `.iloc[]`.

By completing this exercise, you will become proficient in using `.iloc[]` to select specific elements, rows, and columns from a Pandas DataFrame.