**Phase 1: Data Exploration and Cleaning (Essential First Steps)**

1.  **"First, I need you to understand the data. Load the CSV and give me a summary of each column. Tell me the data type, the number of missing values, and the number of unique values for each field."**
    * This involves using Python libraries like Pandas to read the CSV and perform basic data inspection.
    * Essentially, you'll need to use functions like `df.info()`, `df.isnull().sum()`, and `df.nunique()`.

2.  **"Next, we need to clean the data. Handle the missing values appropriately. For numeric fields like 'Price' and 'Stock', consider imputation or removal. For categorical fields like 'Color' or 'Category', decide on a strategy (e.g., fill with 'Unknown' or remove rows)."**
    * This requires you to make decisions based on the context.
    * Imputation (filling missing values) can be done using mean, median, or mode.
    * Removing rows with many missing values might be necessary.

3.  **"Check for inconsistencies in the 'Currency' and 'Price' fields. Ensure all prices are in a consistent currency (e.g., convert all prices to USD if necessary). Also, clean up any text fields like 'Description' or 'Name' by removing unnecessary whitespace or special characters."**
    * This step ensures data quality and consistency.
    * You might need to write functions to clean text.
    * You might need to use external api's to convert currency.

4.  **"Examine the 'Stock' and 'Availability' fields. Are they consistent? Do they provide redundant information? If so, decide which field is more reliable and consider removing the other."**
    * Sometimes, multiple fields convey similar information.

In [13]:
import pandas as pd

In [17]:
df = pd.read_csv('Chocolate Sales.csv')
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,"$5,320",180
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,"$7,896",94
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,"$4,501",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,"$12,726",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,"$13,685",184


## Data Cleaning

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Sales Person   1094 non-null   object
 1   Country        1094 non-null   object
 2   Product        1094 non-null   object
 3   Date           1094 non-null   object
 4   Amount         1094 non-null   object
 5   Boxes Shipped  1094 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 51.4+ KB


In [19]:
df['Integer Amount']=df['Amount'].apply(lambda x: x.split('$')[1])
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped,Integer Amount
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,"$5,320",180,5320
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,"$7,896",94,7896
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,"$4,501",91,4501
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,"$12,726",342,12726
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,"$13,685",184,13685


In [21]:
df['Integer Amount'] = df['Integer Amount'].str.replace(',','').str.strip()
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped,Integer Amount
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,"$5,320",180,5320
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,"$7,896",94,7896
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,"$4,501",91,4501
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,"$12,726",342,12726
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,"$13,685",184,13685


In [22]:
df['Integer Amount']=df['Integer Amount'].apply(lambda x: int(x))
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped,Integer Amount
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,"$5,320",180,5320
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,"$7,896",94,7896
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,"$4,501",91,4501
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,"$12,726",342,12726
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,"$13,685",184,13685


In [23]:
df['Amount'] = df['Integer Amount']
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped,Integer Amount
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,5320,180,5320
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,7896,94,7896
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,4501,91,4501
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,12726,342,12726
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,13685,184,13685


In [24]:
df.rename(mapper={'Amount':'Amount($)'},axis=1,inplace=True)
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount($),Boxes Shipped,Integer Amount
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,5320,180,5320
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,7896,94,7896
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,4501,91,4501
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,12726,342,12726
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,13685,184,13685


In [25]:
df.drop('Integer Amount', inplace=True, axis=1)

In [26]:
df.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount($),Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,5320,180
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,7896,94
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,4501,91
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,12726,342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,13685,184


In [None]:
df.to_csv('Cleaned Chocolate Sales.csv',index=False)