## Pandas-DataFrame and Series

Pandas is a powerful data manipulation library in Python, widely used for data analysis and data cleaning. It provides two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In [102]:
import pandas as pd

In [103]:
## Series
# A Series is a one-dimensional labeled array capable of holding any data type. It is similar to a column in a spreadsheet or a SQL table.

import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
print(type(series))

0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'pandas.core.series.Series'>


In [104]:
## Create a series form a dictionary
data={'a': 1, 'b': 2, 'c': 3}
series_dict=pd.Series(data)
print(series_dict)
print(type(series_dict))

a    1
b    2
c    3
dtype: int64
<class 'pandas.core.series.Series'>


In [105]:
data = [10, 20, 30]
index = ['a', 'b', 'c']
series_custom = pd.Series(data, index=index)
print(series_custom)
print(type(series_custom))

a    10
b    20
c    30
dtype: int64
<class 'pandas.core.series.Series'>


In [106]:
## DataFrame
## Create a dataframe from a dictionary
data={
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)
print(type(df))

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
<class 'pandas.core.frame.DataFrame'>


In [107]:
# Create a dataframe from a list of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

df = pd.DataFrame(data)
print(df)
print(type(df))

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
<class 'pandas.core.frame.DataFrame'>


In [108]:
df = pd.read_csv('SsalesData.csv')
df.head(5)

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,Chad,Office Supplies,Online,L,1/27/2011,292494523,2/12/2011,4484,651.21,524.96,2920025.64,2353920.64,566105.0
1,Europe,Latvia,Beverages,Online,C,12/28/2015,361825549,1/23/2016,1075,47.45,31.79,51008.75,34174.25,16834.5
2,Middle East and North Africa,Pakistan,Vegetables,Offline,C,1/13/2011,141515767,2/1/2011,6515,154.06,90.93,1003700.9,592408.95,411291.95
3,Sub-Saharan Africa,Democratic Republic of the Congo,Household,Online,C,9/11/2012,500364005,10/6/2012,7683,668.27,502.54,5134318.41,3861014.82,1273303.59
4,Europe,Czech Republic,Beverages,Online,C,10/27/2015,127481591,12/5/2015,3491,47.45,31.79,165647.95,110978.89,54669.06


In [109]:
df.tail()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
9995,Asia,Laos,Beverages,Online,H,7/15/2014,199342048,7/31/2014,8597,47.45,31.79,407927.65,273298.63,134629.02
9996,Europe,Liechtenstein,Cosmetics,Online,C,10/27/2012,763044106,11/1/2012,562,437.2,263.33,245706.4,147991.46,97714.94
9997,Sub-Saharan Africa,Democratic Republic of the Congo,Vegetables,Offline,M,2/14/2013,848579967,3/20/2013,2524,154.06,90.93,388847.44,229507.32,159340.12
9998,Sub-Saharan Africa,South Africa,Meat,Online,L,2/19/2017,298185956,2/22/2017,8706,421.89,364.69,3672974.34,3174991.14,497983.2
9999,Asia,Mongolia,Snacks,Offline,M,4/12/2016,824410903,4/16/2016,361,152.58,97.44,55081.38,35175.84,19905.54


In [110]:
### accessing data from the dataframe
df['Region']

0                 Sub-Saharan Africa
1                             Europe
2       Middle East and North Africa
3                 Sub-Saharan Africa
4                             Europe
                    ...             
9995                            Asia
9996                          Europe
9997              Sub-Saharan Africa
9998              Sub-Saharan Africa
9999                            Asia
Name: Region, Length: 10000, dtype: object

In [111]:
df.loc[0]

Region            Sub-Saharan Africa
Country                         Chad
Item Type            Office Supplies
Sales Channel                 Online
Order Priority                     L
Order Date                 1/27/2011
Order ID                   292494523
Ship Date                  2/12/2011
Units Sold                      4484
Unit Price                    651.21
Unit Cost                     524.96
Total Revenue             2920025.64
Total Cost                2353920.64
Total Profit                566105.0
Name: 0, dtype: object

In [112]:
df.iloc[2][3:]

Sales Channel       Offline
Order Priority            C
Order Date        1/13/2011
Order ID          141515767
Ship Date          2/1/2011
Units Sold             6515
Unit Price           154.06
Unit Cost             90.93
Total Revenue     1003700.9
Total Cost        592408.95
Total Profit      411291.95
Name: 2, dtype: object

In [113]:
## accessing a specified element
df.at[1,'Country']

'Latvia'

In [114]:
df.at[1, 'Order ID']

361825549

In [115]:
### Data manipulation with data frames
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

dff = pd.DataFrame(data)

dff['Salary'] = [50000, 60000, 70000]
print(dff)


      Name  Age         City  Salary
0    Alice   25     New York   50000
1      Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   70000


In [116]:
## remove a columnn (Not  a permerent operation unless we do inplace=true)
dff.drop('Salary', axis=1, inplace=True)
print(dff)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [117]:
## add age to a column
dff['Age'] = dff['Age'] + 7
print(dff)

      Name  Age         City
0    Alice   32     New York
1      Bob   37  Los Angeles
2  Charlie   42      Chicago


In [118]:
dff.drop(0, inplace=True)
print(dff)

      Name  Age         City
1      Bob   37  Los Angeles
2  Charlie   42      Chicago


In [None]:
# Display the data types of each column

print("Data types:\n", df.dtypes)

# Describe the DataFrame

print("Statistical summary:\n", df.describe())

# Group by a column and perform an aggregation grouped = df.groupby('Category')['Value'].mean()
grouped = df.groupby('Country')['Order ID'].mean()
print("Mean value by category:\n", grouped)

Data types:
 Region             object
Country            object
Item Type          object
Sales Channel      object
Order Priority     object
Order Date         object
Order ID            int64
Ship Date          object
Units Sold          int64
Unit Price        float64
Unit Cost         float64
Total Revenue     float64
Total Cost        float64
Total Profit      float64
dtype: object
Statistical summary:
            Order ID    Units Sold    Unit Price     Unit Cost  Total Revenue  \
count  1.000000e+04  10000.000000  10000.000000  10000.000000   1.000000e+04   
mean   5.498719e+08   5002.855900    268.143139    188.806639   1.333355e+06   
std    2.607835e+08   2873.246454    217.944092    176.445907   1.465026e+06   
min    1.000892e+08      2.000000      9.330000      6.920000   1.679400e+02   
25%    3.218067e+08   2530.750000    109.280000     56.670000   2.885511e+05   
50%    5.485663e+08   4962.000000    205.700000    117.110000   8.000512e+05   
75%    7.759981e+08   7472.