# Introduction to pandas

## What is pandas?

Pandas is a popular open-source Python library for data manipulation and analysis. It provides powerful data structures and data analysis tools built on top of NumPy, a library for scientific computing in Python. Pandas was created by Wes McKinney in 2008 and is widely used in various fields, including finance, economics, statistics, and data science.

The name "pandas" is derived from the term "panel data," which is a multidimensional data structure commonly used in econometrics and statistics. However, the library is not limited to panel data and can handle a wide range of data structures and formats.

## Why use pandas?

Pandas provides several advantages for working with structured (tabular, multidimensional, potentially heterogeneous) and time series data:

1. **Easy to use**: Pandas has a clean and intuitive syntax for data manipulation and analysis, making it accessible to both novice and experienced users.

2. **High-performance**: Pandas is built on top of NumPy, which is highly optimized for numerical operations. This makes pandas efficient for working with large datasets.

3. **Powerful data structures**: Pandas provides two main data structures: Series (1D labeled homogeneous array) and DataFrame (2D labeled data structure with columns of potentially different types). These data structures make it easy to work with structured data.

4. **Data cleaning and preprocessing**: Pandas offers a wide range of functions and methods for data cleaning, preprocessing, merging, reshaping, and handling missing data.

5. **Data analysis and visualization**: Pandas integrates well with other libraries like Matplotlib, Seaborn, and Plotly for data visualization. It also supports advanced statistical and analytical operations.

6. **Interoperability**: Pandas can read and write data in various formats, such as CSV, Excel, SQL databases, and more, making it easy to work with data from different sources.

7. **Time series analysis**: Pandas has strong support for working with time series data, including handling dates and time zones, resampling, and time-based operations.

These features make pandas a powerful and versatile library for data manipulation, analysis, and exploration, making it a go-to choice for data scientists, analysts, and researchers.

## Importing pandas and other necessary libraries

In [1]:
import pandas as pd
import numpy as np

In this notebook, we import the pandas library and assign it the conventional alias `pd`. We also import the NumPy library, which is commonly used alongside pandas for numerical operations.

# pandas Data Structures

Pandas provides two main data structures: Series and DataFrame. These data structures are designed to handle structured data efficiently and provide powerful tools for data manipulation and analysis.

## Series (1D labeled homogeneous array)

A Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a SQL table, where each entry has a corresponding label (index). A Series can be created from a list, a NumPy array, a dictionary, or a scalar value.

In [2]:
# Creating a Series from a list
data = [1, 2, 3, 4]
series = pd.Series(data)
print(series)

0    1
1    2
2    3
3    4
dtype: int64


In this example, we create a Series from a list of integers. The Series has default integer indices (0, 1, 2, 3) assigned automatically. You can access individual elements of a Series using their index or label.

Series can also be created from dictionaries, where the keys become the labels, and the values become the data:

In [3]:
# Creating a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data)
print(series)

a    10
b    20
c    30
dtype: int64


Series are useful for representing and manipulating one-dimensional data, such as a list of values or a column from a table.

## DataFrame (2D labeled data structure with columns of potentially different types)

A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or a SQL table. It is a collection of Series objects, where each Series represents a column, and each row represents a unique entry with a label (index).

In [1]:
# Creating a DataFrame from a dictionary of lists
import pandas as pd

data = {'Name': ['John', 'Emily', 'Michael', 'Sarah'],
        'Age': [25, 32, 38, 27],
        'City': ['New York', 'London', 'Paris', 'Berlin']}

df = pd.DataFrame(data)
print(df)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


In this example, we create a DataFrame from a dictionary where the keys represent the column names, and the values are lists containing the data for each column.

DataFrames can also be created from lists of lists, NumPy arrays, or other data structures. You can specify column names explicitly or let pandas automatically assign default column names (0, 1, 2, ...).

In [2]:
# Creating a DataFrame from a list of lists
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['Numbers', 'Letters'])
print(df)

   Numbers Letters
0        1       a
1        2       b
2        3       c


DataFrames are powerful data structures that allow you to store and manipulate tabular data efficiently. You can access and manipulate columns and rows using various indexing techniques, perform operations on the data, and apply functions across rows or columns.

In [3]:
# Accessing a column (Series)
print(df['Numbers'])

# Selecting multiple columns
print(df[['Age', 'City']])

0    1
1    2
2    3
Name: Numbers, dtype: int64
   Age    City
0   25  New York
1   32   London
2   38   Paris
3   27   Berlin


DataFrames are widely used in data analysis, data cleaning, and data manipulation tasks due to their flexibility and powerful features. They provide a convenient way to work with structured data and perform various operations, such as filtering, sorting, grouping, and merging data.

# Creating pandas Objects

Pandas provides several ways to create its core data structures, Series and DataFrames, from various data sources. In this section, we'll explore different methods of creating pandas objects from lists, dictionaries, scalars, and various file formats.

In [1]:
import pandas as pd

## Creating Series from lists, dictionaries, and scalars

### Creating a Series from a list

In [2]:
# Creating a Series from a list
data = [1, 2, 3, 4]
series = pd.Series(data)
print(series)

0    1
1    2
2    3
3    4
dtype: int64


In this example, we create a Series from a list of integers. The Series has default integer indices (0, 1, 2, 3) assigned automatically.

### Creating a Series from a dictionary

In [3]:
# Creating a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data)
print(series)

a    10
b    20
c    30
dtype: int64


In this case, we create a Series from a dictionary, where the keys become the labels, and the values become the data.

### Creating a Series from a scalar value

In [4]:
# Creating a Series from a scalar value
scalar_series = pd.Series(5, index=[0, 1, 2, 3])
print(scalar_series)

0    5
1    5
2    5
3    5
dtype: int64


Here, we create a Series from a scalar value (5) and specify the desired index labels ([0, 1, 2, 3]).

Series are useful for representing and manipulating one-dimensional data, such as a list of values or a column from a table.

## Creating DataFrames from lists, dictionaries, and other data structures

### Creating a DataFrame from a dictionary of lists

In [5]:
# Creating a DataFrame from a dictionary of lists
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah'],
        'Age': [25, 32, 38, 27],
        'City': ['New York', 'London', 'Paris', 'Berlin']}

df = pd.DataFrame(data)
print(df)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


In this example, we create a DataFrame from a dictionary where the keys represent the column names, and the values are lists containing the data for each column.

### Creating a DataFrame from a list of lists

In [6]:
# Creating a DataFrame from a list of lists
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['Numbers', 'Letters'])
print(df)

   Numbers Letters
0        1       a
1        2       b
2        3       c


Here, we create a DataFrame from a list of lists, and we explicitly provide the column names using the `columns` parameter.

DataFrames can also be created from NumPy arrays, database tables, or other data structures. This flexibility allows you to work with data from various sources and formats.

## Reading data from various file formats (CSV, Excel, SQL databases, etc.)

Pandas provides convenient functions to read data from various file formats, including CSV, Excel, SQL databases, JSON, and more.

In [1]:
import pandas as pd

### Reading data from a CSV file

In [2]:
# Assuming you have a CSV file named 'data.csv' in the same directory
csv_data = pd.read_csv('data.csv')
print(csv_data)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


The `pd.read_csv()` function reads the data from a CSV file and creates a DataFrame. You can also specify additional parameters to handle different CSV formats, encoding, and other options.

### Reading data from an Excel file

In [3]:
# Assuming you have an Excel file named 'data.xlsx' in the same directory
excel_data = pd.read_excel('data.xlsx')
print(excel_data)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


The `pd.read_excel()` function reads data from an Excel file and creates a DataFrame. You can specify the sheet name, header row, and other options as needed.

### Reading data from a SQL database

In [4]:
# Assuming you have a SQL database connection
import sqlite3

conn = sqlite3.connect('database.db')
query = "SELECT * FROM users"
sql_data = pd.read_sql_query(query, conn)
print(sql_data)

   Name  Age    City
0  John   25  New York
1  Emily   32   London
2  Michael   38   Paris
3  Sarah   27   Berlin


The `pd.read_sql_query()` function executes a SQL query against a database connection and creates a DataFrame from the result. You'll need to establish a connection to your database and provide a valid SQL query.

Pandas also supports reading data from other formats like JSON, HTML, and more. The flexibility in reading data from various sources makes it convenient to work with data from different environments and integrate it into your analysis or processing pipelines.

# Data Inspection and Selection

After creating pandas objects (Series and DataFrames), it's essential to inspect and explore the data before performing any analysis or manipulation. Pandas provides various methods and techniques for data inspection and selection, which we'll cover in this section.

In [1]:
import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Michael', 'Sarah', 'David', 'Jessica'],
        'Age': [25, 32, 38, 27, 45, None],
        'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney'],
        'Income': [50000, 65000, 75000, 42000, 88000, 56000]}

df = pd.DataFrame(data)

## Viewing the first and last few rows

In [2]:
# Viewing the first 3 rows
print(df.head(3))
print("\n")  # Adding a newline for better readability

# Viewing the last 3 rows
print(df.tail(3))

     Name   Age    City  Income
0    John  25.0  New York   50000
1   Emily  32.0   London   65000
2  Michael  38.0   Paris   75000

     Name    Age    City  Income
3   Sarah   27.0  Berlin   42000
4   David   45.0   Tokyo   88000
5  Jessica   NaN  Sydney   56000


The `head()` and `tail()` methods allow you to quickly inspect the first and last few rows of a DataFrame, respectively. By default, they show 5 rows, but you can specify the number of rows to display by passing an argument.

## Checking the data types

In [3]:
# Checking the data types
print(df.dtypes)

Name      object
Age      float64
City      object
Income     int64
dtype: object


The `dtypes` attribute of a DataFrame displays the data types of each column. This is useful for understanding the nature of your data and identifying potential issues, such as mixed data types or missing values.

## Selecting columns and rows by label or index

In [4]:
# Selecting a column (Series)
print(df['Name'])
print("\n")  # Adding a newline for better readability

# Selecting multiple columns
print(df[['Name', 'Age', 'City']].head(3))
print("\n")  # Adding a newline for better readability

# Selecting rows
print(df[['Age', 'Income']])

0    John
1   Emily
2  Michael
3   Sarah
4   David
5  Jessica
Name: Name, dtype: object

     Name   Age    City
0    John  25.0  New York
1   Emily  32.0   London
2  Michael  38.0   Paris

     Age  Income
0   25.0   50000
1   32.0   65000
2   38.0   75000
3   27.0   42000
4   45.0   88000
5    NaN   56000


You can select columns from a DataFrame by using their labels (column names) or integer positions (indices). For example, `df['Name']` selects the 'Name' column, and `df[['Name', 'Age', 'City']]` selects multiple columns.

Similarly, you can select rows using integer positions or boolean indexing, which we'll cover in the next section.

## Conditional selection

In [5]:
# Selecting rows based on a condition
print(df[df['Age'] >= 35])
print("\n")  # Adding a newline for better readability

# Selecting rows with missing values
print(df[df['Age'].isna()])

     Name   Age    City  Income
2  Michael  38.0   Paris   75000
4   David   45.0   Tokyo   88000

     Name    Age    City  Income
5  Jessica   NaN  Sydney   56000
