# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**
4. **Creating DataFrames**
   - From lists, dictionaries, and arrays
   - Reading data from CSV, Excel, and other formats

5. **Basic DataFrame Operations**
   - Inspecting the DataFrame
   - Indexing and selecting data
   - Descriptive statistics

6. **Data Cleaning and Handling Missing Data**
   - Handling missing values
   - Dropping or filling missing values
   - Removing duplicates

### **`4. Creating DataFrames: `**

#### `From Lists, Dictionaries, and Arrays`

**Introduction:**
Creating a Pandas DataFrame is a fundamental step in data analysis. In this prompt, we will explore three common methods for creating DataFrames: using lists, dictionaries, and arrays.

**From Lists:**

1. **Using Lists as Columns:**
   - You can create a DataFrame by using lists as columns. Each list represents a column, and the lengths of the lists must match.

In [1]:
     import pandas as pd

     names = ['Laxman', 'Rajesh', 'Ramulu']
     ages = [25, 30, 35]
     cities = ['Srinagar', 'Pune', 'Mahabubnagar']

     df_from_lists = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities})

     print(df_from_lists)

     Name  Age          City
0  Laxman   25      Srinagar
1  Rajesh   30          Pune
2  Ramulu   35  Mahabubnagar


2. **Specifying Index:**
   - You can specify a custom index for the DataFrame:

In [2]:
     custom_index = ['person1', 'person2', 'person3']
     df_from_lists_custom_index = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities}, index=custom_index)

     print(df_from_lists_custom_index)

           Name  Age          City
person1  Laxman   25      Srinagar
person2  Rajesh   30          Pune
person3  Ramulu   35  Mahabubnagar


**From Dictionaries:**

1. **Using Dictionary Keys as Columns:**
   - Creating a DataFrame from a dictionary allows you to use the keys as column names.

In [3]:
data_dict = {'Name': ['Harshita', 'Naina', 'Sai'],
            'Age': [25, 30, 35],
            'City': ['Pune', 'Pune', 'Hyderabad']}

df_from_dict = pd.DataFrame(data_dict)
print(df_from_dict)

       Name  Age       City
0  Harshita   25       Pune
1     Naina   30       Pune
2       Sai   35  Hyderabad


2. **Specifying Index:**
   - Similar to the list method, you can specify a custom index:

In [4]:
df_from_dict_custom_index = pd.DataFrame(data_dict, index=custom_index)

print(df_from_dict_custom_index)

             Name  Age       City
person1  Harshita   25       Pune
person2     Naina   30       Pune
person3       Sai   35  Hyderabad


**From Arrays:**

1. **Using NumPy Arrays:**
   - NumPy arrays can be used to create DataFrames. Ensure that the dimensions match for each array.

In [5]:
import numpy as np

names_array = np.array(['Alice', 'Bob', 'Charlie'])
ages_array = np.array([25, 30, 35])
cities_array = np.array(['New York', 'San Francisco', 'Los Angeles'])

df_from_arrays = pd.DataFrame({'Name': names_array, 'Age': ages_array, 'City': cities_array})

print(df_from_arrays)

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles


2. **Specifying Index:**
   - As before, you can specify a custom index:

In [6]:
df_from_arrays_custom_index = pd.DataFrame({'Name': names_array, 'Age': ages_array, 'City': cities_array}, index=custom_index)
print(df_from_arrays_custom_index)

            Name  Age           City
person1    Alice   25       New York
person2      Bob   30  San Francisco
person3  Charlie   35    Los Angeles


**Importance of Specifying Column Names and Indices:**

1. **Clarity and Readability:**
   - Specifying meaningful column names enhances the clarity and readability of your code and data.

2. **Consistency in Analysis:**
   - A consistent index allows for smoother and more predictable data analysis, especially when combining DataFrames or performing complex operations.

3. **Avoiding Ambiguity:**
   - Explicitly defining column names and indices avoids ambiguity and ensures that each piece of data is correctly associated with its intended category.

**Conclusion:**
Creating DataFrames in Pandas using lists, dictionaries, and arrays provides flexibility and versatility in handling different types of data. Specifying column names and indices during DataFrame creation is essential for clarity and consistency in subsequent data analysis tasks.

#### Example :

In [8]:
import pandas as pd
import numpy as np

# Creating a DataFrame from Lists
list_data = {
    'Name': ['Laxman', 'Rajesh', 'Bala', 'Ramulu', 'Padma'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df_from_lists = pd.DataFrame(list_data)

# Creating a DataFrame from a Dictionary
dict_data = {
    'ID': [1, 2, 3, 4, 5],
    'Subject': ['Math', 'Physics', 'Chemistry', 'Biology', 'English'],
    'Score': [85, 92, 78, 89, 95],
}

df_from_dict = pd.DataFrame(dict_data)

# Creating a DataFrame from Arrays (using NumPy)
array_data = np.array([
    [1, 'Apple', 3],
    [2, 'Banana', 6],
    [3, 'Orange', 4],
])

df_from_arrays = pd.DataFrame(array_data, columns=['ID', 'Fruit', 'Quantity'])

# Displaying the DataFrames
print("DataFrame from Lists:")
print(df_from_lists)

print("\nDataFrame from Dictionary:")
print(df_from_dict)

print("\nDataFrame from Arrays:")
print(df_from_arrays)


DataFrame from Lists:
     Name  Age  Salary  Experience
0  Laxman   25   50000           3
1  Rajesh   30   60000           5
2    Bala   35   75000           8
3  Ramulu   22   48000           2
4   Padma   28   55000           4

DataFrame from Dictionary:
   ID    Subject  Score
0   1       Math     85
1   2    Physics     92
2   3  Chemistry     78
3   4    Biology     89
4   5    English     95

DataFrame from Arrays:
  ID   Fruit Quantity
0  1   Apple        3
1  2  Banana        6
2  3  Orange        4


#### Real World Scenario:
Imagine you have survey data from a group of people regarding their preferences for various types of electronic devices. Each person's data includes their ID, name, age, and the ratings (out of 10) they gave to different devices like smartphones, laptops, and smartwatches.

In [9]:
import pandas as pd

# Sample survey data
survey_data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Laxman', 'Rajesh', 'Padma', 'Harshita', 'Naina'],
    'Age': [25, 30, 35, 22, 28],
    'Smartphone_Rating': [9, 8, 7, 9, 8],
    'Laptop_Rating': [8, 7, 9, 6, 8],
    'Smartwatch_Rating': [7, 6, 8, 7, 9],
}

# Creating a DataFrame from the survey data
df_survey = pd.DataFrame(survey_data)

# Displaying the DataFrame
print("Survey Data DataFrame:")
print(df_survey)

Survey Data DataFrame:
   ID      Name  Age  Smartphone_Rating  Laptop_Rating  Smartwatch_Rating
0   1    Laxman   25                  9              8                  7
1   2    Rajesh   30                  8              7                  6
2   3     Padma   35                  7              9                  8
3   4  Harshita   22                  9              6                  7
4   5     Naina   28                  8              8                  9


#### Considerations or Peculiarities:

- **Column Consistency:** Ensure consistency in the length of lists or arrays when creating a DataFrame. All lists should have the same length, or dictionaries should have the same set of keys.

- **Data Types:** Be mindful of the data types within lists or arrays. Pandas will attempt to infer data types, but it's helpful to explicitly specify them if needed.

- **Indexing:** Decide whether you need to set a specific column as the index. In the example above, 'ID' is set as the index, but you may choose another column or leave it with the default integer index.

#### Common Mistakes:

- **Mismatched Lengths:** Forgetting to check and ensure that all lists or arrays used to create a DataFrame have the same length can lead to errors.

- **Misspelled Column Names:** When creating a DataFrame from a dictionary, ensure that the keys represent column names. Misspelling a key may result in the creation of a new column instead of using an existing one.

- **Incorrect Data Types:** If your data types are not appropriate, it can lead to unexpected results. Check that numeric columns are treated as numbers, and categorical columns are specified as such.

#### `Reading Data into a DataFrame from Various Formats`

**Introduction:**
Pandas provides versatile functions to read data from different file formats, making it a powerful tool for handling diverse data sources. In this prompt, we will explore how to read data into a DataFrame from common formats such as CSV and Excel, and discuss additional formats supported by Pandas.

**Reading from CSV:**

1. **Using `read_csv` Function:**
   - Reading data from a CSV file is straightforward using the `read_csv` function:

In [10]:
import pandas as pd

df_csv = pd.read_csv('data.csv')

print(df_csv)

     Name  Age          City
0  Laxman   25          Pune
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar


2. **Customizing Parameters:**
   - You can customize parameters such as delimiter, encoding, and header during reading:

In [11]:
df_custom_csv = pd.read_csv('data.csv', delimiter=';', encoding='utf-8', header=0)
print(df_custom_csv)

         Name,Age,City
0       Laxman,25,Pune
1  Rajesh,30,Hyderabad
2  Ram,22,Mahabubnagar


#### Explanation:


1. `pd.read_csv('data.csv', delimiter=';', encoding='utf-8', header=0)`:
   - `pd.read_csv`: This function is a part of the Pandas library and is used to read data from a CSV file into a DataFrame.
   - `'data.csv'`: Specifies the name of the CSV file to be read. Make sure the file exists in the specified location.
   - `delimiter=';'`: Specifies the delimiter used in the CSV file. In this case, it's set to a semicolon (`;`).
   - `encoding='utf-8'`: Specifies the character encoding of the file. 'utf-8' is a widely used encoding format for text files.
   - `header=0`: Indicates that the first row of the CSV file contains column names. This is used as the header row for the DataFrame.

2. `print(df_custom_csv)`: This line prints the DataFrame (`df_custom_csv`) to the console. The DataFrame is a tabular data structure that organizes data into rows and columns.


**Reading from Excel:**

1. **Using `read_excel` Function:**
   - Reading data from an Excel file is accomplished with the `read_excel` function:

In [12]:
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print(df_excel)

       Name  Age          City
0  Harshita    8          Pune
1     Naina    1          Pune
2       Sai    5  Mahabubnagar
3   Aanchal    3        Ongole
4    Laxman   36      Srinagar
5     Padma   29          Pune
6     Ganga   32  Mahabubnagar
7    Rajesh   30          Pune


2. **Specifying Columns:**
   - You can specify columns to read from Excel:

In [13]:
df_excel_columns = pd.read_excel('data.xlsx', sheet_name='Sheet1', usecols=['Name', 'Age'])

print(df_excel_columns)

       Name  Age
0  Harshita    8
1     Naina    1
2       Sai    5
3   Aanchal    3
4    Laxman   36
5     Padma   29
6     Ganga   32
7    Rajesh   30


**Reading from Other Formats:**

1. **JSON:**
   - Pandas supports reading data from JSON files:
     ```python
     df_json = pd.read_json('data.json')
     ```

2. **HTML (Web Scraping):**
   - Reading tables from HTML pages (web scraping) is possible using `read_html`:
     ```python
     url = 'https://example.com/table'
     df_html = pd.read_html(url)[0]  # [0] selects the first table from the page
     ```

3. **SQL Databases:**
   - Reading data from SQL databases using `read_sql`:
     ```python
     from sqlalchemy import create_engine

     engine = create_engine('sqlite:///example.db')
     query = 'SELECT * FROM my_table'
     df_sql = pd.read_sql(query, engine)
     ```

**Flexibility in Handling Diverse Data Sources:**

1. **URLs and HTTP(S):**
   - Reading data directly from URLs:
     ```python
     url = 'https://example.com/data.csv'
     df_url = pd.read_csv(url)
     ```

2. **ZIP Archives:**
   - Reading data from files within a ZIP archive:
     ```python
     df_zip = pd.read_csv('archive.zip', compression='zip', header=0)
     ```

3. **Reading from Clipboard:**
   - Copying data to the clipboard and reading it directly into a DataFrame:
     ```python
     df_clipboard = pd.read_clipboard()
     ```

**Conclusion:**
Pandas' flexibility in reading data from various formats, including CSV, Excel, JSON, HTML, SQL databases, and more, makes it a versatile tool for handling diverse data sources. The ability to read directly from URLs, ZIP archives, and the clipboard enhances its capabilities for real-world data scenarios.

Output of shape : (1,2,3)
1 : How many 2D arrays we have
2,3 : represents the shape of the 2D arrays