# Exercise

In the following we are conducting several exercises about basic pandas analysis.

In this exercise, you will work with a fictional dataset containing sales data for a retail store. The dataset is provided in CSV format and consists of the following columns:

1. Order_ID: Unique identifier for each order.
2. Product: Name of the product sold.
3. Category: Category of the product (e.g., Electronics, Clothing, Furniture).
4. Price: Price of the product.
5. Quantity: Quantity of the product sold.
5. Order_Date: Date and time of the order.
Your task is to use pandas to perform various data analysis tasks and derive insights from the dataset.

In [2]:
!pip install pandas 

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: tzdata, pandas
Successfully installed pandas-2.2.3 tzdata-2024.2


In [5]:
import pandas as pd

# Create a fictional dataset
data = {
    'Employee_ID': [101, 102, 103, 104, 105],
    'Name': ['John', 'Alice', 'Bob', 'Emily', 'David'],
    'Department': ['HR', 'IT', 'Marketing', 'Finance', 'HR'],
    'Position': ['Manager', 'Developer', 'Marketing Specialist', 'Accountant', 'HR Assistant'],
    'Salary': [6000, 5000, 4500, 5500, 4000],
    'Hire_Date': ['2020-01-15', '2019-05-20', '2020-03-10', '2018-11-25', '2021-02-05']
}

# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)

### 2. Display Basic Information:
- Display the first 5 rows of the DataFrame.
- Display the basic information about the DataFrame (number of rows, columns, data types).

In [6]:
df.head()


Unnamed: 0,Employee_ID,Name,Department,Position,Salary,Hire_Date
0,101,John,HR,Manager,6000,2020-01-15
1,102,Alice,IT,Developer,5000,2019-05-20
2,103,Bob,Marketing,Marketing Specialist,4500,2020-03-10
3,104,Emily,Finance,Accountant,5500,2018-11-25
4,105,David,HR,HR Assistant,4000,2021-02-05


In [9]:
df.describe()

Unnamed: 0,Employee_ID,Salary
count,5.0,5.0
mean,103.0,5000.0
std,1.581139,790.569415
min,101.0,4000.0
25%,102.0,4500.0
50%,103.0,5000.0
75%,104.0,5500.0
max,105.0,6000.0


In [10]:
df.columns

Index(['Employee_ID', 'Name', 'Department', 'Position', 'Salary', 'Hire_Date'], dtype='object')

In [11]:
df.dtypes

Employee_ID     int64
Name           object
Department     object
Position       object
Salary          int64
Hire_Date      object
dtype: object

### 2.Summary Statistics:
- Calculate and display summary statistics for numerical columns (count, mean, min, max, etc.).

In [12]:
df.describe()

Unnamed: 0,Employee_ID,Salary
count,5.0,5.0
mean,103.0,5000.0
std,1.581139,790.569415
min,101.0,4000.0
25%,102.0,4500.0
50%,103.0,5000.0
75%,104.0,5500.0
max,105.0,6000.0


3. Data Manipulation:
- Convert the Hire_Date column to datetime format.
- Add a new column named Years_Worked that represents the number of years each employee has worked in the company (as of the current year).

In [None]:
df.Hire_Date  = pd.to_datetime(df.Hire_Date)

In [16]:
df.dtypes

Employee_ID             int64
Name                   object
Department             object
Position               object
Salary                  int64
Hire_Date      datetime64[ns]
dtype: object

In [17]:
from datetime import datetime

today = pd.to_datetime(datetime.now().date())

In [27]:
df['Years']= (today - df.Hire_Date ).dt.days // 365

In [28]:
df

Unnamed: 0,Employee_ID,Name,Department,Position,Salary,Hire_Date,Years
0,101,John,HR,Manager,6000,2020-01-15,4
1,102,Alice,IT,Developer,5000,2019-05-20,5
2,103,Bob,Marketing,Marketing Specialist,4500,2020-03-10,4
3,104,Emily,Finance,Accountant,5500,2018-11-25,6
4,105,David,HR,HR Assistant,4000,2021-02-05,3


### 4. Data Filtering:
- Filter the DataFrame to include only employees who work in the 'HR' department.
- Display the filtered DataFrame.

In [29]:
df[df["Department"]=="HR"]

Unnamed: 0,Employee_ID,Name,Department,Position,Salary,Hire_Date,Years
0,101,John,HR,Manager,6000,2020-01-15,4
4,105,David,HR,HR Assistant,4000,2021-02-05,3
