# __Working with Text Data in Pandas__

## __Agenda__

In this lesson, we will cover the following concepts with the help of examples:

- Text Data in Pandas
- Iteration
  * Iterating over Rows
  * Applying a Function to Each Element
  * Vectorized Operations
  * Iterating over Series
- Sorting
  * Sorting DataFrame by Column
  * Sorting DataFrame by Multiple Columns
  * Sorting DataFrame by Index
  * Sorting a Series
- Plotting with Pandas

### __1. Text Data in Pandas__

Working with text data in Pandas involves various operations to manipulate and analyze textual information. Here are some common text data operations using Pandas:

In [None]:
# Assuming 'df' is DataFrame with a 'Column' containing text data
import pandas as pd

df = pd.DataFrame({'Column': ['Hello', 'World', 'Python', 'Data Science']})

# Calculates the length of each string
df['Length'] = df['Column'].str.len()
print("Length of each string:")
print(df[['Column', 'Length']])


In [None]:
df

In [None]:
df['new_col'] = [1,2,3,4]

In [None]:
df

In [None]:
df["Lowercase"] = df['Column'].str.lower()
df

In [None]:
# Assuming 'df' is DataFrame with a 'Column' containing text data
df = pd.DataFrame({'Column': ['Hello', 'World', 'Python', 'Data Science']})

# Converts text to lowercase
df['Lowercase'] = df['Column'].str.lower()
print("\nText in lowercase:")
print(df[['Column', 'Lowercase']])


In [None]:
df

In [None]:
df["Lowercase"].str.contains("python")

In [None]:
df['result'] = df["Lowercase"].str.contains("python")
df

In [None]:
# Assuming 'df' is DataFrame with a 'Column' containing text data
df = pd.DataFrame({'Column': ['Hello', 'World', 'Python', 'Data Science']})

# Checks if each string contains the specified substring
substring = 'Data'
df['ContainsSubstring'] = df['Column'].str.contains(substring)
print("\nContains substring 'Data':")
print(df[['Column', 'ContainsSubstring']])


## __2. Iteration__

Iteration in Pandas typically involves traversing through the rows or elements of a DataFrame or Series. 
- However, it is important to note that direct iteration over DataFrame rows using Python's for loop is generally discouraged due to performance reasons. 
- Instead, Pandas provides efficient methods for iteration and applying functions to DataFrame elements.

![image.png](attachment:6a6a1846-b952-488f-aaa4-f52af904864c.png)

### __2.1 Iterating over Rows__

In [None]:
import pandas as pd

# Assuming 'df' is your DataFrame with columns 'Column1' and 'Column2'
df = pd.DataFrame({'Column1': [1, 2, 3], 'Column2': ['A', 'B', 'C']})

for index, row in df.iterrows():
    print(f"Index: {index}, Data: {row['Column1']}, {row['Column2']}")


In [None]:
df

In [None]:
for i in df.iterrows():
    print(i), print(type(i))

In [None]:
for i,j in df.iterrows():
    print(i), print(type(i))
    print(j),print(type(j))

### __2.2 Applying a Function to Each Element__

In [None]:
df

In [None]:
list(map(lambda x : x**2 , [1,2,3,4,5]))

In [None]:
df['Column1'].apply(lambda x : x**2)

In [None]:
# Assuming 'df' is your DataFrame with 'ExistingColumn'
df = pd.DataFrame({'ExistingColumn': [10, 20, 30]})

df['NewColumn'] = df['ExistingColumn'].apply(lambda x: x * 2)
print(df)


### __2.3 Vectorized Operations__

In [None]:
# Assuming 'df' is your DataFrame with 'ColumnA' and 'ColumnB'
df = pd.DataFrame({'ColumnA': [1, 2, 3], 'ColumnB': [4, 5, 6]})

df['ResultColumn'] = df['ColumnA'] + df['ColumnB']
print(df)

### __2.4 Iterating over Series__

In [None]:
# Assuming 'series' is your Pandas Series
series = pd.Series([10, 20, 30], name='Values')

for index, value in series.items():
    print(f"Index: {index}, Value: {value}")

In [None]:
series.items()

## __3. Sorting__
Sorting in Pandas involves arranging the elements of a DataFrame or Series based on specific criteria, such as column values or indices. 

![image.png](attachment:8e942634-2dc9-459e-b9f9-726ca9073afa.png)

### __3.1 Sorting DataFrame by Column__

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 22, 30],
                   'Salary': [50000, 60000, 75000]})

# Sort DataFrame by the 'Age' column in ascending order
df_sorted = df.sort_values(by='Age')
print("Sorted DataFrame by Age:\n", df_sorted)

### __3.2 Sorting DataFrame by Multiple Columns__

In [None]:
# Sort DataFrame by 'Age' in ascending order, then by 'Salary' in descending order
df_sorted_multi = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])
print("\nSorted DataFrame by Age and Salary:\n", df_sorted_multi)


### __3.3 Sorting DataFrame by Index__

In [None]:
# Sort DataFrame by index in descending order
df_sorted_index = df.sort_index(ascending=False)
print("\nSorted DataFrame by Index:\n", df_sorted_index)


### __3.4 Sorting a Series__

In [None]:
# Create a sample Series
series = pd.Series([25, 22, 30], index=['Alice', 'Bob', 'Charlie'], name='Age')

# Sort Series in descending order
series_sorted = series.sort_values(ascending=False)
print("\nSorted Series by Age:\n", series_sorted)


## __4. Plotting with Pandas__
Plotting data is required to visualize the data in Python. Python uses the plot method in pandas to create plots; there are many types of plots available in pandas.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Plot a line chart
df.plot()
plt.show()

In [None]:
# Plot a bar chart
df.plot(kind='bar')
plt.show()


# __Assisted Practice__

## __Problem Statement:__
Create a detailed report on the monthly weather data by performing text manipulation, data sorting, and visualization to analyze temperature and precipitation trends.

__Data:__
The dataset contains daily observations of temperature and precipitation over a month.

## __Steps to Perform:__

1. Textual manipulation
- Convert 'Day' to a string format with appropriate suffixes (1st, 2nd, 3rd, and so on)
- Classify 'Temperature' into categories (Low, Medium, High) based on predefined thresholds
- Determine if 'Precipitation' falls under 'Light', 'Moderate', or 'Heavy' rainfall

2. Iteration and data aggregation
- Iterate over the DataFrame to calculate weekly averages of temperature and precipitation
- Summarize findings in a new DataFrame

3. Sorting
- Sort the DataFrame by 'Temperature' and 'Precipitation' in ascending and descending order
- Sort the DataFrame by the day of the month

4. Plotting
- Create a line plot for temperature trends
- Create a bar plot to compare precipitation levels across the month

In [None]:
def add_suffix(day):
    if 10 <= day % 100 <=20:
        suffix = 'th'
    else:
        suffix = {1: 'st', 2: 'nd', 3:'rd'}
    return str(day) + suffix

In [None]:
df['day'].apply(add_suffix)

In [None]:
def categories(temp):
   pass # condition