<a href="https://colab.research.google.com/github/sarahajbane/Unicorn/blob/main/Data_Transformation_Andreu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Transformation**

In this session, we will explore how to manipulate and transform datasets using Pandas. We will focus on operations like string slicing, applying string methods, renaming columns, adding and deleting columns, and using advanced data types. Below is the step-by-step explanation of the exercises.

 Importing and Preparing Data

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls '/content/drive/MyDrive'

"Andreu's Getting started with Python and Pandas.pptx"	'Time sheet Template.gsheet'
'Colab Notebooks'					'W1 DA Sept 24 summary.gdoc'
 Datasets


In [None]:
# open the file
df = pd.read_csv('/content/drive/MyDrive/Datasets/w2d1dataset.csv')

Use .head() to view the first 5 rows and understand the structure of the dataset.

In [None]:
df.head()

Unnamed: 0,CustomerID,Date,ProductID,Quantity,UnitPrice,SalesAmount,CustomerRegion
0,C001,1/15/2024,P001,2,20,40,North
1,C002,1/18/2024,P002,1,25,25,East
2,C003,1/20/2024,P001,3,20,60,South
3,C001,1/25/2024,P003,1,100,100,North
4,C004,1/26/2024,P002,2,25,50,West


##Strings in Python

Extract the year from the Date column using slicing and create a new column Year.

**Method:**
Use .str to access string methods in Pandas.
str[-4:] slices the last 4 characters of the string, which corresponds to the year.

A new column 'Year' is added to the dataset.

In [None]:
df['Year'] = df['Date'].str[-4:]
print(df[['Date', 'Year']])

         Date  Year
0   1/15/2024  2024
1   1/18/2024  2024
2   1/20/2024  2024
3   1/25/2024  2024
4   1/26/2024  2024
5    2/1/2024  2024
6    2/5/2024  2024
7   2/10/2024  2024
8   2/12/2024  2024
9   2/15/2024  2024
10  2/20/2024  2024
11  2/22/2024  2024
12   3/1/2024  2024
13   3/3/2024  2024
14   3/5/2024  2024
15   3/7/2024  2024


Concatenate CustomerRegion and ProductID into a new column RegionProduct:
**Method:**
Use the + operator to concatenate strings.
Add an underscore _ as a separator between the two values.

A new column RegionProduct is created.

In [None]:
df['RegionProduct'] = df['CustomerRegion'] + "_" + df['ProductID']
print(df[['CustomerRegion', 'ProductID', 'RegionProduct']])


   CustomerRegion ProductID RegionProduct
0           North      P001    North_P001
1            East      P002     East_P002
2           South      P001    South_P001
3           North      P003    North_P003
4            West      P002     West_P002
5            East      P001     East_P001
6           North      P004    North_P004
7           South      P002    South_P002
8            East      P003     East_P003
9           North      P004    North_P004
10           East      P001     East_P001
11          South      P002    South_P002
12          North      P003    North_P003
13           West      P004     West_P004
14          North      P001    North_P001
15          South      P002    South_P002


##Working with Strings in Pandas

Convert the CustomerRegion column to uppercase using *.str.upper()* to convert strings to uppercase.

All values in the CustomerRegion column are in uppercase.

In [None]:
df['CustomerRegion'] = df['CustomerRegion'].str.upper()
print(df['CustomerRegion'])


0     NORTH
1      EAST
2     SOUTH
3     NORTH
4      WEST
5      EAST
6     NORTH
7     SOUTH
8      EAST
9     NORTH
10     EAST
11    SOUTH
12    NORTH
13     WEST
14    NORTH
15    SOUTH
Name: CustomerRegion, dtype: object


Split the RegionProduct column into two new columns: Region and Product using .str.split().

**Method:**

Use .str.split('_') to split the string at the underscore.
Set expand=True to create separate columns for each split value.

Two new columns, Region and Product, are added.

In [None]:
df[['Region', 'Product']] = df['RegionProduct'].str.split('_', expand=True)
print(df[['Region', 'Product']])


   Region Product
0   North    P001
1    East    P002
2   South    P001
3   North    P003
4    West    P002
5    East    P001
6   North    P004
7   South    P002
8    East    P003
9   North    P004
10   East    P001
11  South    P002
12  North    P003
13   West    P004
14  North    P001
15  South    P002


Check if ProductID contains the substring "P00" and create a new column ContainsP00 with the result.

**Method:**
Use .str.contains() to return True if the substring is found and False otherwise.

A new column ContainsP00 is added with boolean values.

In [None]:
df['ContainsP00'] = df['ProductID'].str.contains('P00')
print(df[['ProductID', 'ContainsP00']])


   ProductID  ContainsP00
0       P001         True
1       P002         True
2       P001         True
3       P003         True
4       P002         True
5       P001         True
6       P004         True
7       P002         True
8       P003         True
9       P004         True
10      P001         True
11      P002         True
12      P003         True
13      P004         True
14      P001         True
15      P002         True


##Renaming, Adding, and Deleting Columns

**Rename column**

Rename the CustomerRegion column to Region for simplicity.
**Method:**

Use .rename(columns={old_name: new_name}).
Set inplace=True to apply changes to the DataFrame directly.The column name will be updated

In [None]:
df.rename(columns={'CustomerRegion': 'Region'}, inplace=True)
#, 'ProductID': 'product_id'
print(df.columns)


Index(['CustomerID', 'Date', 'product_id', 'Quantity', 'UnitPrice',
       'SalesAmount', 'Region'],
      dtype='object')


###**Adding new column**

Calculate the net sales after applying a 10% discount.

**Method:**
Subtract 10% of SalesAmount from itself and assign the result to a new column NetSales.

Result: A new column NetSales is added.

In [None]:
df['NetSales'] = df['SalesAmount'] - (df['SalesAmount'] * 0.10)
print(df[['SalesAmount', 'NetSales']])


    SalesAmount  NetSales
0            40      36.0
1            25      22.5
2            60      54.0
3           100      90.0
4            50      45.0
5            80      72.0
6           200     180.0
7            50      45.0
8           100      90.0
9           200     180.0
10           40      36.0
11           25      22.5
12          100      90.0
13          200     180.0
14           40      36.0
15           75      67.5


###**Delete a Column**

Remove the RegionProduct column to simplify the dataset.
Method: Use .drop(columns=[column_name]) and set inplace=True.
Result: The RegionProduct column is deleted.


In [None]:
df.drop(columns=['RegionProduct'], inplace=True)
print(df.head())


  CustomerID       Date ProductID  Quantity  UnitPrice  SalesAmount Region  \
0       C001  1/15/2024      P001         2         20           40  NORTH   
1       C002  1/18/2024      P002         1         25           25   EAST   
2       C003  1/20/2024      P001         3         20           60  SOUTH   
3       C001  1/25/2024      P003         1        100          100  NORTH   
4       C004  1/26/2024      P002         2         25           50   WEST   

   Year Region Product  ContainsP00  NetSales  
0  2024  North    P001         True      36.0  
1  2024   East    P002         True      22.5  
2  2024  South    P001         True      54.0  
3  2024  North    P003         True      90.0  
4  2024   West    P002         True      45.0  


## **Convert to Categorical Data Type**


Convert the Region column to a categorical data type for memory optimization and faster operations.

**Method:**

Use .astype('category').

In [None]:
df['Region'] = df['Region'].astype('category')
print(df.dtypes)


CustomerID       object
Date             object
ProductID        object
Quantity          int64
UnitPrice         int64
SalesAmount       int64
Region         category
Year             object
Region         category
Product          object
ContainsP00        bool
NetSales        float64
dtype: object


##**Change Data Type to Float**

Convert the Quantity column from integer to float for more precise calculations.

**Method:**
 Use .astype(float)  and The column data type will changes to float64.

In [None]:
df['Quantity'] = df['Quantity'].astype(float)
print(df.dtypes)


CustomerID      object
Date            object
product_id      object
Quantity       float64
UnitPrice        int64
SalesAmount      int64
Region          object
dtype: object


In [None]:
df

Unnamed: 0,CustomerID,Date,product_id,Quantity,UnitPrice,SalesAmount,Region
0,C001,1/15/2024,P001,2.0,20,40,North
1,C002,1/18/2024,P002,1.0,25,25,East
2,C003,1/20/2024,P001,3.0,20,60,South
3,C001,1/25/2024,P003,1.0,100,100,North
4,C004,1/26/2024,P002,2.0,25,50,West
5,C005,2/1/2024,P001,4.0,20,80,East
6,C006,2/5/2024,P004,1.0,200,200,North
7,C007,2/10/2024,P002,2.0,25,50,South
8,C002,2/12/2024,P003,1.0,100,100,East
9,C008,2/15/2024,P004,1.0,200,200,North


**Thank you!**

Keep Practicing..