
## **1. Datasets in Python**

### **What are Datasets?**
A **dataset** is a structured collection of data typically organized into rows and columns. It represents information like customer details, sales transactions, survey responses, or scientific observations.

### **Key Properties of Datasets:**
- **Rows:** Represent individual records or observations.
- **Columns:** Represent features or attributes of the data.
- **Index:** An identifier or row label to uniquely represent each record.

### **Common Dataset Formats in Python**
Python supports multiple file formats for storing and analyzing datasets:
1. **CSV (Comma-Separated Values):**
   - Stores data in plain text format with columns separated by commas.
   - Advantages: Easy to read and supported by almost all software.
   - Used in Pandas with `pd.read_csv()`.

2. **Excel Files:**
   - Stores data in a grid format, supports formulas, charts, and formatting.
   - Pandas uses `pd.read_excel()` to work with Excel files.

3. **SQL Databases:**
   - Stores data in relational tables.
   - Efficient for large datasets and complex queries.
   - Pandas connects using `pd.read_sql_query()`.

4. **JSON (JavaScript Object Notation):**
   - Lightweight data format for exchanging data between systems.
   - Structured as key-value pairs.
   - Loaded with `pd.read_json()`.

5. **Other Formats:**
   - HTML tables (`pd.read_html()`).
   - HDF5 files (`pd.read_hdf()`).
   - Pickle files (`pd.read_pickle()`).

---

## **2. Basics of Pandas**

### **What is Pandas?**
Pandas is a Python library designed for data manipulation and analysis. It is built on top of NumPy and offers advanced data-handling capabilities.

### **Key Components of Pandas**
1. **Series:**
   - A one-dimensional labeled array.
   - Think of it as a single column of data.
   - Each element is associated with an index.

2. **DataFrame:**
   - A two-dimensional table-like structure with labeled rows and columns.
   - Can hold data of different types (e.g., integers, floats, strings).

---

### **Installing Pandas**
To install Pandas, use the following command:
```bash
pip install pandas
```

### **Importing Pandas**
Always import Pandas before using it:
```python
import pandas as pd
```

---

### **Basic Operations with Pandas**
Here are some commonly used Pandas operations to get started:

- **Importing Libraries:**
   ```python
   import pandas as pd
   import numpy as np
   ```

- **Creating Data Structures:**
   - A Series:
     ```python
     data = pd.Series([10, 20, 30, 40])
     print(data)
     ```
   - A DataFrame:
     ```python
     data = {
         'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [25, 30, 35],
         'City': ['New York', 'Los Angeles', 'Chicago']
     }
     df = pd.DataFrame(data)
     print(df)
     ```

---

## **3. DataFrames**

### **What is a DataFrame?**
A DataFrame is a two-dimensional data structure in Pandas, similar to an Excel spreadsheet or a SQL table.

### **Key Features of DataFrames**
1. **Labeled Rows and Columns:**
   - Each row and column has a label (index for rows, names for columns).
2. **Heterogeneous Data:**
   - Columns can store data of different types.
3. **Size Mutability:**
   - DataFrames can expand or shrink dynamically.

---

### **Creating DataFrames**
You can create a DataFrame in multiple ways:

#### **3.1 From a Dictionary**
A dictionary can map column names to lists or arrays of values:
```python
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
```

**Output:**
```
      Name  Age          City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

---

#### **3.2 From a List of Dictionaries**
A list of dictionaries is useful when each dictionary represents a row of data:
```python
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)
```

**Output:**
```
      Name  Age          City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

---

#### **3.3 From a NumPy Array**
If you have structured data in a NumPy array, you can convert it into a DataFrame:
```python
import numpy as np

# Create a NumPy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create a DataFrame
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)
```

**Output:**
```
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
```

---

#### **3.4 From CSV Files**
A CSV (Comma-Separated Values) file is a simple text file with rows of data separated by commas.

##### **Reading a CSV File**
To load a CSV file into a Pandas DataFrame:
```python
df = pd.read_csv('data.csv')
print(df.head())  # View the first 5 rows
```

##### **Writing to a CSV File**
You can export a DataFrame to a CSV file:
```python
df.to_csv('output.csv', index=False)
```

---

#### **3.5 From SQL Databases**
Pandas allows integration with SQL databases using the `pd.read_sql_query()` method.

##### **Example with SQLite**
```python
import sqlite3

# Connect to a database (or create one if it doesn't exist)
conn = sqlite3.connect('example.db')

# Create a SQL query
query = "SELECT * FROM employees"

# Load the query result into a DataFrame
df = pd.read_sql_query(query, conn)
print(df.head())
```

##### **Exporting to SQL**
DataFrames can also be exported to a SQL database:
```python
df.to_sql('employees', conn, if_exists='replace', index=False)
```

---

#### **3.6 Creating an Empty DataFrame**
An empty DataFrame can be initialized and populated later:
```python
# Create an empty DataFrame
df = pd.DataFrame(columns=['A', 'B', 'C'])
print(df)

# Add rows to the DataFrame
df.loc[0] = [1, 2, 3]
df.loc[1] = [4, 5, 6]
print(df)
```

**Output:**
```
Empty DataFrame
Columns: [A, B, C]
Index: []

   A  B  C
0  1  2  3
1  4  5  6
```

---


## **4. Exploring DataFrames**

After creating or loading a DataFrame, the next step is to explore and understand its structure and content. Pandas provides several functions to make this process easier.

### **4.1. Inspecting the Data**
#### **`head()` and `tail()`**
- `head(n)` displays the first `n` rows (default is 5).
- `tail(n)` displays the last `n` rows (default is 5).
```python
print(df.head())  # First 5 rows
print(df.tail(3))  # Last 3 rows
```

#### **Output Example:**
```
# Sample DataFrame (df)
      Name  Age       City
0    Alice   25   New York
1      Bob   30  Los Angeles
2  Charlie   35     Chicago

# head()
print(df.head())
      Name  Age       City
0    Alice   25   New York
1      Bob   30  Los Angeles

# tail()
print(df.tail(1))
      Name  Age    City
2  Charlie   35  Chicago
```

---

### **4.2. Summary Information**
#### **`info()`**
Provides a summary of the DataFrame, including:
- Data types of each column.
- Non-null counts.
- Memory usage.
```python
print(df.info())
```

**Output:**
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
```

---

### **4.3. Shape**
The `shape` attribute provides the number of rows and columns as a tuple:
```python
print(df.shape)
```
**Output:**
```
(3, 3)  # 3 rows, 3 columns
```

---

### **4.4. Descriptive Statistics**
#### **`describe()`**
Displays summary statistics for numerical columns:
```python
print(df.describe())
```

**Output:**
```
             Age
count   3.000000
mean   30.000000
std     5.000000
min    25.000000
25%    27.500000
50%    30.000000
75%    32.500000
max    35.000000
```

For non-numeric data:
```python
print(df.describe(include='object'))
```

---

### **4.5. Viewing Data**
#### **Values**
The `values` attribute returns the data in the DataFrame as a NumPy array:
```python
print(df.values)
```
**Output:**
```
[['Alice' 25 'New York']
 ['Bob' 30 'Los Angeles']
 ['Charlie' 35 'Chicago']]
```

#### **Columns**
The `columns` attribute lists column names:
```python
print(df.columns)
```
**Output:**
```
Index(['Name', 'Age', 'City'], dtype='object')
```

#### **Index**
The `index` attribute provides the row indices:
```python
print(df.index)
```
**Output:**
```
RangeIndex(start=0, stop=3, step=1)
```

---

## **5. Accessing Data**

Pandas provides multiple methods to access and manipulate data within a DataFrame.

---

### **5.1. Column Access**
Columns can be accessed using:
1. **Dot Notation**:
   ```python
   print(df.Name)
   ```
2. **Bracket Notation**:
   ```python
   print(df['Name'])
   ```

#### **Adding a New Column**
You can create a new column:
```python
df['Country'] = ['USA', 'USA', 'USA']
print(df)
```

#### **Deleting a Column**
Use the `drop()` method:
```python
df = df.drop('Country', axis=1)  # axis=1 for columns
```

---

### **5.2. Pandas Series**
A Series is a one-dimensional array-like object. Each column of a DataFrame is a Series.

#### **Creating a Series**
```python
data = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(data)
```

#### **Accessing Series Elements**
Use labels or indices:
```python
print(data['a'])  # By label
print(data[0])    # By index
```

---

### **5.3. Accessing a Cell**
You can access individual elements using:
1. **`at`** (label-based access):
   ```python
   print(df.at[0, 'Name'])
   ```
2. **`iat`** (integer index-based access):
   ```python
   print(df.iat[0, 0])
   ```

---

### **5.4. Row Access**
#### **Using Slicing**
Access rows using slicing:
```python
print(df[1:3])  # Rows 1 and 2
```

#### **Accessing a Specific Row**
```python
print(df.loc[1])  # Row with index 1
```

---

### **5.5. Label-Based Access (`loc`)**
The `loc` method accesses rows and columns by labels.

#### **Selecting Rows**
```python
print(df.loc[0])  # First row
```

#### **Selecting Rows and Columns**
```python
print(df.loc[0, 'Name'])  # First row, 'Name' column
```

#### **Selecting Multiple Rows and Columns**
```python
print(df.loc[0:1, ['Name', 'Age']])
```

---

### **5.6. Integer Position-Based Access (`iloc`)**
The `iloc` method accesses rows and columns by their numerical indices.

#### **Selecting Rows**
```python
print(df.iloc[0])  # First row
```

#### **Selecting Rows and Columns**
```python
print(df.iloc[0, 1])  # First row, second column
```

#### **Selecting Multiple Rows and Columns**
```python
print(df.iloc[0:2, 0:2])  # First two rows, first two columns
```

---

### **5.7. Multi-Column Access**
To access multiple columns, use a list of column names:
```python
print(df[['Name', 'City']])
```

---

### **5.8. Row and Column Location**
You can combine row and column access methods:
```python
print(df.loc[0:1, 'Name':'City'])  # Rows 0-1, columns 'Name' to 'City'
print(df.iloc[0:2, 0:2])           # First 2 rows, first 2 columns
```

---


### Questions

1. **Basic Exploration**  
   Load the dataset and perform the following operations:
   - Display the first 10 rows of the dataset.
   - Show the column names, data types, and non-null counts using an appropriate method.

2. **Accessing Columns**  
   Extract the **`Username`** and **`Subscribers`** columns into a new DataFrame. Display the first 5 rows of the resulting DataFrame.

---



3. **Data Summary and Statistics**  
   - Identify the total number of rows and columns in the dataset.  
   - Display a summary of the dataset, including descriptive statistics for numeric columns.  
   - Check for missing values in the dataset.



4. **Filtering Rows Based on Conditions**  
   Create a new DataFrame containing only channels:
   - With **Subscribers greater than 100 million**.
   - Belonging to a country other than the US.  
   Show the top 5 rows of this filtered DataFrame.

---


5. **Using `loc` and `iloc`**  
   Perform the following operations:  
   - Select and display the rows where the **Country** is 'IN' (India), using label-based access (`loc`).  
   - Using integer-based access (`iloc`), retrieve and display the first 10 rows of only the **Ranking**, **Username**, and **Subscribers** columns.  

6. **Handling and Cleaning Data**  
   Perform the following operations to clean and explore the dataset:
   - Remove any rows with missing values in the **Country** column.  
   - Convert the **Subscribers** and **Views** columns to numeric data types by removing extra characters like commas and spaces.  
   - Display the total number of subscribers (sum of the column) after cleaning.  
