### **What is Pandas?**
Pandas is a Python library that works with big datasets to analyze, clean, explore, and manipulate data.
<br> The "Pandas" name is derived from the term "Panel Data," which is an econometrics term for multidimensional structured data sets.

As mentioned above it is a Python library, a Python package, to install it on our virtual environment, type in the terminal:
<br>pip install pandas

"pandas" is built on top of "NumPy" and is intended to integrate well within a scientific computing environment with many other 3rd party libraries,
which means "pandas" can be used in numerical analysis, machine learning, data visualization, and more, thus, 3rd party libraries such as NumPy, SciPy, Matplotlib, and others are commonly used with pandas.

To work with pandas we need to import the pandas module:

In [1]:
import pandas as pd
import numpy as np

### **Series:**

A Series is the same as one column in table, it is a single array, that can have numerical, categorical, timestamp, etc.

Let's create a series

In [2]:
one_fruit = {
    "name": "orange",
    "weight_kg": 25,
    "price_per_kg": 1.5,
}

fruit_series = pd.Series(one_fruit)

print(fruit_series)

name            orange
weight_kg           25
price_per_kg       1.5
dtype: object


In [3]:
# Using the index parameter:
one_fruit = {
    "name": "orange",
    "weight_kg": 25,
    "price_per_kg": 1.5,
}

fruit_series = pd.Series(one_fruit, index=["name", "weight_kg", "price_per_kg"])

print(fruit_series)

name            orange
weight_kg           25
price_per_kg       1.5
dtype: object


In [4]:
# If the names in the index parameter are different frome the dictionay keys,
# the values will be NaN:
one_fruit = {
    "name": "orange",
    "weight_kg": 25,
    "price_per_kg": 1.5,
}

fruit_series = pd.Series(one_fruit, index=["name_fruit", "weight", "price"])

print(fruit_series)

name_fruit    NaN
weight        NaN
price         NaN
dtype: object


In [5]:
# Using the name parameter to give the series a name:
one_fruit = {
    "name": "orange",
    "weight_kg": 25,
    "price_per_kg": 1.5,
}

fruit_series = pd.Series(one_fruit, name="fruit")

print(fruit_series)

name            orange
weight_kg           25
price_per_kg       1.5
Name: fruit, dtype: object


As shown above the series dtype is an object, because the series has string and floats.

In [6]:
# Using a list to create a series:
num_list = [1, 2, 3]

numbers_list = pd.Series(num_list, index=["a", "b", "c"])

print(numbers_list)

a    1
b    2
c    3
dtype: int64


In [7]:
# Using a 1-D array to create a series:
num_array = np.array([4.4, 5.2, 6.2])

numbers_array = pd.Series(num_array, index=["a", "b", "c"])

print(numbers_array)

a    4.4
b    5.2
c    6.2
dtype: float64


In [8]:
# If the Series items are strings the dtype is an object.
names_list = ["Tom", "Mary", "John"]

names_series = pd.Series(names_list, index=["a", "b", "c"])

print(names_series)

a     Tom
b    Mary
c    John
dtype: object


### **DataFrame:**

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure, which means the data table can have numerical, categorical, timestamp, etc.
<br>It is similar to an Excel sheet or SQL table.

Let's create a DataFrame:

In [9]:
import pandas as pd
import numpy as np

In [10]:
# We will use the DataFrame class
data_fruits = {
    "name": ["orange", "apple", "banana", "pear", "cherry"],
    "weight_kg": [25, 34, 20, 37, 30],
    "price_per_kg": [1.5, 2.3, 1.24, 2.3, 2.6]
}
# As shown the data is heterogeneous

In [11]:
df_fruits = pd.DataFrame(data=data_fruits)

print(df_fruits)

     name  weight_kg  price_per_kg
0  orange         25          1.50
1   apple         34          2.30
2  banana         20          1.24
3    pear         37          2.30
4  cherry         30          2.60


In [12]:
# Use the "dtypes" property:
print(df_fruits.dtypes)

name             object
weight_kg         int64
price_per_kg    float64
dtype: object


As shown above, the "name" data type is "object", in pandas the strings and boolean datatype is an "object".
<br> The DataFrame datatype is an object if it has different datatypes.

In [13]:
# Creating DataFrame using Series using the index parameter to specify the data rows:
data = {'integers': [33, 44, 51, 93], 'floats': pd.Series([5.9, 2.6], index=[2, 3])}

# df_data = pd.DataFrame(data, index=[1, 2, 3, 4])
df_data = pd.DataFrame(data, index=[0, 1, 2, 3])

print(df_data)

   integers  floats
0        33     NaN
1        44     NaN
2        51     5.9
3        93     2.6


### **Different Data Types**

In [14]:
import pandas as pd
import numpy as np

In [15]:
# Create DataFrame with different data types:
diff_dtypes_df = pd.DataFrame({
    'ints': [3, None, 4],  # Becomes Float64 due to NaN unless specified as Int64
    'floats': [2.5, 9.1, None],
    'bools': [True, None, False],
    'dates': pd.to_datetime(["2024-02-12", None, "2025-02-01"]), # Use the "to_datatime" function to convert the data to dates and times.
    'categories': pd.Categorical(["M", "F", "M"]), # Use the "Categorical" class to convert the data to categorical data. 
    'strings': ["Tom", "Mary", None]
})

print(diff_dtypes_df.dtypes)

ints                 float64
floats               float64
bools                 object
dates         datetime64[ns]
categories          category
strings               object
dtype: object


In [16]:
print(diff_dtypes_df)

   ints  floats  bools      dates categories strings
0   3.0     2.5   True 2024-02-12          M     Tom
1   NaN     9.1   None        NaT          F    Mary
2   4.0     NaN  False 2025-02-01          M    None


In [17]:
# To change the strings and bools columns from object to string and boolean data type
# and change the "ints" column to integer:
diff_dtypes_df = pd.DataFrame({
    'ints': [3, None, 4],  # Becomes Float64 due to NaN unless specified as Int64
    'floats': [2.5, 9.1, None],
    'bools': [True, None, False],
    'dates': pd.to_datetime(["2024-02-12", None, "2025-02-01"]),
    'categories': pd.Categorical(["M", "F", "M"]),
    'strings': ["Tom", "Mary", None],
})

# Get each column separately same as dictionaries:
diff_dtypes_df["ints"] = diff_dtypes_df["ints"].astype("Int32")
diff_dtypes_df["bools"] = diff_dtypes_df["bools"].astype("boolean")
diff_dtypes_df["strings"] = diff_dtypes_df["strings"].astype("string")

print(diff_dtypes_df.dtypes)
print()
print(diff_dtypes_df)

ints                   Int32
floats               float64
bools                boolean
dates         datetime64[ns]
categories          category
strings       string[python]
dtype: object

   ints  floats  bools      dates categories strings
0     3     2.5   True 2024-02-12          M     Tom
1  <NA>     9.1   <NA>        NaT          F    Mary
2     4     NaN  False 2025-02-01          M    <NA>


### **"iloc" and "loc" Properties:**
To get specific data from a DataFrame, we use the following properties:
<br>1.***iloc*** property is an integer location based on indexing.
<br>2.***loc*** property is used to access a group of rows and columns by label(s). 

In [18]:
# let's take an example "iloc":
# To get a specific row from "diff_dtypes_df", let's say the 2nd row:
diff_dtypes_df.iloc[1]

ints          <NA>
floats         9.1
bools         <NA>
dates          NaT
categories       F
strings       Mary
Name: 1, dtype: object

In [19]:
# To get a specific column from "diff_dtypes_df", let's say the "strings" column:
diff_dtypes_df.iloc[:, 5]

0     Tom
1    Mary
2    <NA>
Name: strings, dtype: string

In [20]:
# To get a specific value from "diff_dtypes_df", let's say the 3rd row (index 2) and the "categories" (index 4):
diff_dtypes_df.iloc[2,4]

'M'

In [21]:
# You can do slicing:
diff_dtypes_df.iloc[1:3,3:5]

Unnamed: 0,dates,categories
1,NaT,F
2,2025-02-01,M


The same Python slicing and indexing rules apply to pandas.
<br>Check the Python indexing and slicing video:
<br>https://youtu.be/e97_5Y_VMZk

In [22]:
import pandas as pd
import numpy as np

In [23]:
diff_dtypes = pd.DataFrame({
    'ints': [3, None, 4],
    'floats': [2.5, 9.1, None],
    'bools': [True, None, False],
    'dates': pd.to_datetime(["2024-02-12", None, "2025-02-01"]),
    'categories': pd.Categorical(["M", "F", "M"]),
    'strings': ["Tom", "Mary", None],
}, index=["data1", "data2", "data3"])

print(diff_dtypes)

       ints  floats  bools      dates categories strings
data1   3.0     2.5   True 2024-02-12          M     Tom
data2   NaN     9.1   None        NaT          F    Mary
data3   4.0     NaN  False 2025-02-01          M    None


In [24]:
# let's take an example "loc":
# To get a specific row from "diff_dtypes_df", let's say the 2nd row:
diff_dtypes.loc["data1"]

ints                          3.0
floats                        2.5
bools                        True
dates         2024-02-12 00:00:00
categories                      M
strings                       Tom
Name: data1, dtype: object

In [25]:
# To get a specific column from "diff_dtypes_df", let's say the "strings" column:
diff_dtypes.loc[:, "strings"]

data1     Tom
data2    Mary
data3    None
Name: strings, dtype: object

In [26]:
# To get a specific value from "diff_dtypes_df", let's say the 3rd row (index 2) and the "categories" (index 4):
diff_dtypes.loc["data3", "categories"]

'M'

In [27]:
# You can do slicing:
diff_dtypes.loc["data2": "data3", "dates": "categories"]

Unnamed: 0,dates,categories
data2,NaT,F
data3,2025-02-01,M


In [28]:
# If the rows' indices are numbers, as shown below:
diff_dtypes_df = pd.DataFrame({
    'ints': [3, None, 4],  # Becomes Float64 due to NaN unless specified as Int64
    'floats': [2.5, 9.1, None],
    'bools': [True, None, False],
    'dates': pd.to_datetime(["2024-02-12", None, "2025-02-01"]),
    'categories': pd.Categorical(["M", "F", "M"]),
    'strings': ["Tom", "Mary", None],
})

print(diff_dtypes_df)

   ints  floats  bools      dates categories strings
0   3.0     2.5   True 2024-02-12          M     Tom
1   NaN     9.1   None        NaT          F    Mary
2   4.0     NaN  False 2025-02-01          M    None


In [29]:
# You can use numbers and strings with the "loc" property:
diff_dtypes_df.loc[1: 3, "dates": "categories"]

Unnamed: 0,dates,categories
1,NaT,F
2,2025-02-01,M


In [30]:
# You can not use strings with the "iloc" property:
diff_dtypes.iloc[1: 3, "dates": "categories"]

TypeError: cannot do positional indexing on Index with these indexers [dates] of type str

To copy a DataFrame we use the "copy" function:

In [None]:
diff_dtypes_df_copy = diff_dtypes.copy()

In [None]:
diff_dtypes_df_copy