# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates

### **`5. Basic DataFrame Operations`**

#### **`Inspecting the DataFrame`**

**Introduction:**
Inspecting a DataFrame is an essential step in understanding its structure and contents. Pandas provides several methods that allow you to gain insights into the data quickly. In this prompt, we'll explore common methods such as `head()`, `tail()`, `info()`, `shape`, and `describe()`.

**Using `head()` and `tail()`:**

1. **`head(n)`:**
   - The `head()` method displays the first `n` rows of the DataFrame. It is useful for quickly getting an overview of the dataset.

In [2]:
import pandas as pd

df = pd.read_csv('data.csv')
df_head = df.head(5)  # Display the first 5 rows

print(df_head)

     Name  Age          City
0  Laxman   25          Pune
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar
3   Ganga   32  Mahabubnagar
4  Jamuna   32          Pune


2. **`tail(n)`:**
   - The `tail()` method shows the last `n` rows of the DataFrame, allowing you to inspect the end of the dataset.

In [3]:
df_tail = df.tail(5)  # Display the last 5 rows

print(df_tail)

      Name  Age          City
4   Jamuna   32          Pune
5  Namrata   15          Pune
6   Varsha   16     Hyderabad
7   Vamshi   22     Hyderabad
8   Ananya   14  Mahabubnagar


**Using `info()`:**

1. **`info()`:**
   - The `info()` method provides a concise summary of the DataFrame, including the data types, non-null counts, and memory usage.

In [4]:
df_info = df.info()

print(df_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    9 non-null      object
 1   Age     9 non-null      int64 
 2   City    9 non-null      object
dtypes: int64(1), object(2)
memory usage: 344.0+ bytes
None


**Using `shape`:**

1. **`shape`:**
   - The `shape` attribute returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).

In [5]:
df_shape = df.shape

print(df_shape)

(9, 3)


**Using `describe()`:**

1. **`describe()`:**
   - The `describe()` method generates descriptive statistics, including measures of central tendency, dispersion, and shape of the distribution.

In [6]:
df_describe = df.describe()

print(df_describe)

# Note : In the output only numerical columns are displayed
# Name, City are not displayed 

             Age
count   9.000000
mean   23.111111
std     7.166667
min    14.000000
25%    16.000000
50%    22.000000
75%    30.000000
max    32.000000


2. **Customizing `describe()`:**
   - You can customize the output of `describe()` to include specific percentiles or types of statistics.

In [8]:
custom_describe = df.describe(percentiles=[0.386, 0.5, 0.618, 0.786], include='all')

print(custom_describe)

          Name        Age  City
count        9   9.000000     9
unique       9        NaN     3
top     Laxman        NaN  Pune
freq         1        NaN     3
mean       NaN  23.111111   NaN
std        NaN   7.166667   NaN
min        NaN  14.000000   NaN
38.6%      NaN  22.000000   NaN
50%        NaN  22.000000   NaN
61.8%      NaN  24.832000   NaN
78.6%      NaN  30.576000   NaN
max        NaN  32.000000   NaN


#### Explanation:

In the Pandas `describe()` method, the `include` parameter is used to specify the types of columns to be included in the summary statistics. It allows you to control whether to include only numeric columns, only object (string) columns, or include all columns regardless of their data types. The `include` parameter accepts different values:

- `'all'`: This includes all columns, regardless of their data types. Both numeric and non-numeric columns will be summarized.

- `'number'`: This includes only numeric columns in the summary. Non-numeric columns, such as strings or categorical data, will be excluded from the output.

- `'object'`: This includes only object (string) columns in the summary. Numeric columns will be excluded.

Here's an example to illustrate the usage of the `include` parameter:


In [9]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'numeric_col': [1, 2, 3, 4, 5],
    'string_col': ['apple', 'banana', 'orange', 'grape', 'kiwi']
}

df = pd.DataFrame(data)

# Using describe with different include values
all_columns_describe = df.describe(include='all')
numeric_columns_describe = df.describe(include='number')
object_columns_describe = df.describe(include='object')

print("Describe All Columns:")
print(all_columns_describe)

print("\nDescribe Numeric Columns Only:")
print(numeric_columns_describe)

print("\nDescribe Object (String) Columns Only:")
print(object_columns_describe)


Describe All Columns:
        numeric_col string_col
count      5.000000          5
unique          NaN          5
top             NaN      apple
freq            NaN          1
mean       3.000000        NaN
std        1.581139        NaN
min        1.000000        NaN
25%        2.000000        NaN
50%        3.000000        NaN
75%        4.000000        NaN
max        5.000000        NaN

Describe Numeric Columns Only:
       numeric_col
count     5.000000
mean      3.000000
std       1.581139
min       1.000000
25%       2.000000
50%       3.000000
75%       4.000000
max       5.000000

Describe Object (String) Columns Only:
       string_col
count           5
unique          5
top         apple
freq            1


**Conclusion:**
Inspecting a DataFrame is a crucial step in the data analysis process. Methods such as `head()`, `tail()`, `info()`, `shape`, and `describe()` provide valuable information about the structure, contents, and statistical summary of the dataset. Using these methods allows you to quickly assess the data and make informed decisions about further analysis.

#### Examples

In [12]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Laxman', 'Laxmikanth', 'Ashwanth', 'Ashok', 'Venky'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)

# Displaying the DataFrame
print("Original DataFrame:")
print(df)

# Using head() and tail() for an overview
print("\nFirst 3 Rows (head()):")
print(df.head(3))

print("\nLast 2 Rows (tail()):")
print(df.tail(2))

# Using info() for a summary
print("\nDataFrame Info:")
df_info = df.info()
print(df_info)

# Using shape to get dimensions
df_shape = df.shape
print("\nDataFrame Shape:", df_shape)

# Using describe() for summary statistics
df_describe = df.describe()
print("\nSummary Statistics:\n", df_describe)

# Displaying the results
print("\nResults:")





Original DataFrame:
         Name  Age  Salary  Experience
0      Laxman   25   50000           3
1  Laxmikanth   30   60000           5
2    Ashwanth   35   75000           8
3       Ashok   22   48000           2
4       Venky   28   55000           4

First 3 Rows (head()):
         Name  Age  Salary  Experience
0      Laxman   25   50000           3
1  Laxmikanth   30   60000           5
2    Ashwanth   35   75000           8

Last 2 Rows (tail()):
    Name  Age  Salary  Experience
3  Ashok   22   48000           2
4  Venky   28   55000           4

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Salary      5 non-null      int64 
 3   Experience  5 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes
None

DataFrame Shape: (5, 4)

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about sales transactions for an e-commerce platform. You want to inspect the data to understand its structure, check for missing values, and get a quick overview of the sales performance.

In [13]:
import pandas as pd

# Sample e-commerce sales data
sales_data = {
    'OrderID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Camera'],
    'Quantity': [2, 1, 3, 2, 1],
    'Price': [1200, 800, 300, 150, 700],
    'CustomerID': [101, 102, 103, 104, 105],
    'Date': ['2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03', '2022-01-03'],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Inspecting the DataFrame
print("Overview of Sales Data:")
print(df_sales.head())
print("\nStructure of Sales Data:")
print(df_sales.info())
print("\nSummary Statistics of Sales Data:")
print(df_sales.describe())

Overview of Sales Data:
   OrderID     Product  Quantity  Price  CustomerID        Date
0      101      Laptop         2   1200         101  2022-01-01
1      102  Smartphone         1    800         102  2022-01-02
2      103      Tablet         3    300         103  2022-01-02
3      104  Headphones         2    150         104  2022-01-03
4      105      Camera         1    700         105  2022-01-03

Structure of Sales Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   OrderID     5 non-null      int64 
 1   Product     5 non-null      object
 2   Quantity    5 non-null      int64 
 3   Price       5 non-null      int64 
 4   CustomerID  5 non-null      int64 
 5   Date        5 non-null      object
dtypes: int64(4), object(2)
memory usage: 368.0+ bytes
None

Summary Statistics of Sales Data:
          OrderID  Quantity        Price  CustomerI

#### Considerations or Peculiarities:

- **Data Types:** Ensure that data types are appropriate for each column. Dates should be in datetime format, and numerical columns should have the correct data type.

- **Missing Values:** Check for missing values using methods like `isnull()` or `info()`. Decide on a strategy to handle missing data if needed.

- **Categorical Columns:** Identify and encode categorical columns appropriately. Some columns may have a finite set of categories, and using the `astype('category')` method can save memory.

#### Common Mistakes:

- **Neglecting Missing Values:** Ignoring missing values during inspection can lead to incorrect analyses. Always check for missing data and decide how to handle it.

- **Not Understanding Data Types:** Misinterpreting data types may lead to errors in analysis. Make sure to understand the meaning and representation of each column's data type.

- **Overlooking Categorical Variables:** Categorical variables may not always be automatically identified. Check and convert categorical columns if needed, especially if they are nominal or ordinal.

Inspecting the DataFrame is a crucial step to understand the data's characteristics and make informed decisions during data analysis. Adapt the example code and considerations based on the specifics of your real-world datasets.