<h1>Tasks Performed</h1>
The project embarked upon a comprehensive examination of diverse datasets provided by JCPenney, with the primary goal of extracting meaningful insights and understanding the intrinsic properties of the data. The datasets scrutinized included:

- jcpenney_reviewers.json: This dataset concentrated on information pertaining to the reviewers.
- jcpenney_products.json: It contained exhaustive details on various products.
- products.csv: This was an aggregation of data related to products.
- reviews.csv: It compiled reviews for an array of products.
- users.csv: This dataset was a repository of user information.

A methodical approach was used in this extensive data analysis report to extract insights from JCPenney's diverse datasets, notably `users.csv}, `reviews.csv}, and `jcpenney_reviewers.json}. The analysis followed a defined methodology that included data loading, preliminary investigation, data visualization, and data type classification.
<h2>Data exploration</h2>

**Data exploration and Initial Investigation:**
The first step in the process for each dataset was to load the data into a pandas DataFrame and specify the file path. This essential first step prepared the ground for further examination.

**Visualization for Initial Assessment:**
Using pandas' `head()` and `tail()` functions, the data's initial structure and content were examined visually after loading. This two-pronged investigation made it possible to watch how data was presented at both the beginning and the end of the datasets, guaranteeing that the whole range of data was taken into account. The output was carefully formatted with different border colors, font tweaks, and backdrop modifications to improve readability.
**Dimensionality:**
The datasets' dimensions were explained, displaying the number of rows and columns that gave a general idea of the dataset's size. The comprehensive dataset details were printed, providing an understanding of the composition and integrity of the data. These details included data types and non-null value counts for every column.

**Descriptive Statistics:**
The distribution of the dataset was analyzed using descriptive statistics, which produced a quantitative summary that included features like central tendency, dispersion, and distribution shape. When it came to spotting trends and any abnormalities in the numerical data, these figures were essential.

**Data Completeness Evaluation:**
A comprehensive search for null values was conducted. Finding the missing data is essential for the data cleaning steps that follow and for guaranteeing the analysis's robustness.

**Uniqueness and Categorization:**
The datasets were examined closely to ensure that each column's values were unique. This stage was essential for figuring out how diverse the data were and for helping to distinguish between continuous and categorical variables. In order to handle unhashable types gracefully and guarantee a smooth analysis process, a try-except block was utilized.


**Column/Variable Analysis:**
Based on the data types in each column, the columns were categorized as either categorical or non-categorical, with an emphasis on differentiating between different object or category kinds. This categorization established the foundation for customized analytical methods based on the type of data in each column.

<h2>Data visualization</h2>
The data visualization process implemented for the users.csv dataset emphasized a meticulous examination of the unique values within the data, the geographical distribution of users across different states, and the  disstribution of users in the top states. Initially, the data was prepared by converting date-related columns to appropriate datetime formats to facilitate temporal analysi. A bar chart showing the count of unique values per columns of the datasets provided insights into the variety and potential categorization of the variables. This was followed by a detailed state distribution analysis through a bar chart, which shed light on user demographics and potential market segments. In order to provide more insightss, i finished the  visualization  with a pie chart that highlighted the top 10 states by user proportion. 


<h2>Data validation</h2>
The goal of the {products.csv} dataset data validation process was to find and fix missing data. The dataset was first loaded, and to determine whether the data was complete, a thorough check for null values was made. This initial examination revealed missing values in a number of different columns. The dataset was then streamlined for improved data integrity and reliability during a cleaning phase in which all rows containing null values were eliminated. The successful elimination of all missing data was validated by a last validation check, guaranteeing that the dataset was now free of null values and better suited for precise and thorough data analysis. 
The quality of the dataset was improved by this meticulous approach to data validation, opening the door for more reliable and significant insights in future analysis.

<h2>Data visualization</h2>
After a thorough data analysis of the {reviews.csv} dataset, several visualizations were made to explore the characteristics of product reviews. To visualize the distribution of review scores and effectively highlight the frequency of each score category, the process started with the creation of a count plot. This provided insightful information about trends and preferences in customer satisfaction. An attribute called "Review Length" was added to the data in order to better understand its length for each review. Making use of this, a histogram was plotted along with a kernel density estimate to display the distribution of review lengths and identify patterns in the expression of customer feedback. After that, the visualization's main goal was to determine which ten products had received the most reviews. To improve clarity, this involved tallying distinct product IDs and combining them with matching product names. The resultant bar chart gave consumers a clear visual picture of these products and highlighted areas where they were very engaged or interested. A similar strategy was also used to highlight the top 10 users according to the quantity of reviews they had submitted. These users were displayed using a bar chart, which highlights the most engaged members of the review community. When taken as a whole, these visualizations provided a comprehensive understanding of the review data, including quantitative metrics like review counts and lengths as well as qualitative elements like review quality.
This comprehensive visualization played a pivotal role in understanding consumer behavior, product performance, and overall user engagement within the dataset.


<h2>Data analysis</h2>
The analysis involved evaluating customer reviews from the jcpenney_products.json dataset using NLTK's VADER sentiment intensity analyzer. A structured DataFrame containing review texts and their distinct identifiers was produced by loading and processing the data. Based on its sentiment score, a custom function that used VADER categorized each review into positive, negative, or neutral categories. The number of reviews in each sentiment category was then calculated by averaging this sentiment data. A visually appealing bar chart showing the distribution of sentiments among the reviews and annotated with the number of reviews in each sentiment category was the end result of the analysis. This thorough method yielded a concise sentiment summary of the dataset's customer reviews. After that, the list was placed into a brand-new JSON file called "reviews_with_tone.json." The first ten entries of the newly generated file were loaded into a pandas DataFrame and styled for improved readability, demonstrating the integration of sentiment analysis results into the initial review data, in order to confirm the process's success.






In [None]:
# IMPORT ALL NEEDED LIBRARIES
import pandas as pd
import seaborn as sns  # Importing the seaborn library for data visualization
import json  # Importing the json library for working with JSON data
import matplotlib.pyplot as plt  # [Warning] Duplicate import statement
from sklearn.feature_extraction.text import CountVectorizer  # Importing the CountVectorizer class from sklearn for text analysis
import nltk  # Importing the nltk library for natural language processing tasks
from nltk.corpus import stopwords  # Importing the stopwords corpus from nltk for text analysis
from nltk.probability import FreqDist  # Importing the FreqDist class from nltk for frequency distribution analysis
from nltk.sentiment import SentimentIntensityAnalyzer  # Importing the SentimentIntensityAnalyzer class from nltk for sentiment analysis
from nltk.tokenize import word_tokenize  # Importing the word_tokenize function from nltk for tokenizing text
import warnings  # Importing the warnings module for handling warnings
warnings.filterwarnings('ignore')  # Ignoring warning messages during execution

<a id="1"></a>
## <b>1.0 <span style='color:#B21010'></span>Data exploration</b>


In [None]:

# Specifying the path to the users.csv file
dataPath = 'users.csv'

# Reading the users.csv file into a dataframe
dfUsers = pd.read_csv(dataPath)

# Printing the first few rows of the users.csv file using the head() function
print("First Few Rows of users.csv using head():")

# Styling the first 10 rows of the dfUsers dataframe for better visualization
styledHeadDf = dfUsers.head(10).style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': 'black',
        'font-size': '10px',
        'width':'20%',
    }
)
styledHeadDf

In [None]:
# Using pandas' 'tail()' function, we show the dataset's final 10 rows for a cursory analysis.
# This aids in giving a brief overview of the dataset, including with column names and values.
print("Last 10 rows of users.csv using tail():")

# Styling the last 10 rows of the dfUsers dataframe for better visualization
styledTailDf = dfUsers.tail(10).style.set_properties(
    **{
        'border': '1.3px solid black',
        'color': 'white',
        'background-color': 'grey',
        'font-size': '10px',
        'width':'20%',
    }
)

# Outputting the styled dataframe
styledTailDf

In [None]:
# Printing the number of rows and columns in the users.csv dataset
print(f"\nThe users.csv dataset has {dfUsers.shape[0]} rows and {dfUsers.shape[1]} columns.\n")

In [None]:
# Printing dataset information
print("Dataset Info:\n")
# Printing the information about the dfUsers dataframe using the info() function
print(dfUsers.info())

In [None]:
# Printing descriptive statistics for the users.csv dataset
print("\nDescriptive Statistics for users.csv:")

# Calculating the descriptive statistics for the dfUsers dataframe using the describe() function
statsDescriptionForUsers = dfUsers.describe().style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': 'grey',
        'font-size': '10px',
        'width':'20%',
    }
)

# Outputting the styled descriptive statistics dataframe
statsDescriptionForUsers

In [None]:
# The number of missing (null) values in each column of the dataset is determined and shown by this code block.
# The 'isnull()' function in pandas is used to find nulls in the DataFrame, and then'sum()' is used to aggregate the nulls column-wise.
# To find and properly manage missing data, this check is essential throughout the preprocessing stage of data.
print("Null Values in each column:")
print(dfUsers.isnull().sum())

In [None]:
# Display the number of unique values in each column, handling unhashable types
# In the code block below an iteration through each columns of every dataset is done using pandas' 
# 'nunique()' to count unique values. A try-except block handles  unhashable types like lists or dictionaries,
#  outputting a custom message in such cases.  This helps in assessing data variability and identifying potential categorical columns.
  
print("Number of unique values in each column:")
for col in dfUsers.columns:
    try:
        print(f"{col}: {dfUsers[col].nunique()}")
    except TypeError:
        print(f"{col}: Unhashable type")


In [None]:
# Identify categorical and non-categorical columns.
# The purpose of this code below  is to divide the dataset's columns into categories and non-categorical data types. Using pandas''select_dtypes' function, it first attempts to locate categorical columns (those with data types of 'object' or 'category'). List comprehension is used to manually identify 'object' type columns as categorical in the event that a TypeError is raised (due to unhashable types in columns). Next, the 'object' and 'category' data types are excluded in order to identify the non-categorical columns.
# The function ends  by printing lists of both category and non-categorical columns. This gives a clear picture of the structure of the dataset, which is crucial for further data analysis procedures.
try:
    categorical_cols = dfUsers.select_dtypes(include=['object', 'category']).columns.tolist()
except TypeError:
    categorical_cols = [col for col in dfUsers.columns if dfUsers[col].dtype == 'object']
non_categorical_cols = dfUsers.select_dtypes(exclude=['object', 'category']).columns.tolist()

print("Categorical Columns:")
print(categorical_cols)
print("\nNon-Categorical Columns:")
print(non_categorical_cols)


In [None]:
# Reviews.csv
dataPath = 'reviews.csv' 
dfReviews = pd.read_csv(dataPath)
# Displaying the top few rows of the dataset for preliminary examination using pandas',
#  'head()' method. The structure of the dataset, including column names and beginning values,
#  is briefly summarized using this technique.
#  Verifying data loading and comprehending dataset layout are crucial aspects of data analysis.
print("First 10 rows of reviews.csv using head():")
styledHeadDfReviews = dfReviews.head(10).style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': 'black',
        'font-size': '10px',
#         'width':'10%',
        'border':'1px,1px,1px,1px'
    }
)
# Output the head()
styledHeadDfReviews

In [None]:
# Displaying the last 10  rows of the dataset for preliminary examination using pandas's,
#  'tail()' method. The structure of the dataset, including column names and values,
#  is briefly summarized using this technique.
#  Verifying data loading and comprehending dataset layout are crucial aspects of data analysis.
print("Last 10 rows of reviews.csv using tail():")
styledTailDfReviews = dfReviews.tail(10).style.set_properties(
    **{
        'border': '1.3px solid black',
        'color': 'white',
        'background-color': 'grey',
        'font-size': '10px',
    }
)
styledTailDfReviews

In [None]:
# Display the number of rows and columns in the dataset using the the shape() 
# method from pandas,the shape method returns back a tuple that reflects the DataFrame's dimensions.
print(f"\nThe reviews.csv dataset has {dfReviews.shape[0]} rows and {dfUsers.shape[1]} columns.\n")

In [None]:
#The code block below provides comprehensive dataset details,
# such as column data types and non-null counts using the info method from pandas
print("Dataset Info:\n")
print(dfReviews.info())

In [None]:
# The code block produces descriptive statistics for the dataset,
#  including max, min, and standard deviation. 
# This is accomplished by using the 'describe()' function, and transposing the result for easier reading.
# The DataFrame is also styled for improved visual appeal, including border radius, border style, text color, font size, and background color.

print("\nDescriptive Statistics for reviews.csv:")
statsDescriptionForReviews = dfReviews.describe().style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': '#900C3F',
        'font-size': '10px',
        'width':'20%',
    }
)
# print(styled_df)
statsDescriptionForReviews

In [None]:
# The number of missing (null) values in each column of the dataset is determined and shown by this code block.
# The 'isnull()' function in pandas is used to find nulls in the DataFrame, and then'sum()' is used to aggregate the nulls column-wise.
print("Null Values in each column:")
print(dfReviews.isnull().sum())


In [None]:
# Displays the number of unique values in each column, handling unhashable types
# In the code block below an iteration through each columns of every dataset is done using pandas' 
#'nunique()' to count unique values. A try-except block handles  unhashable types like lists or dictionaries, 
#outputting a custom message in such cases.  This helps in assessing data variability and identifying potential categorical columns.
print("Number of unique values in each column:")
for col in dfReviews.columns:
    try:
        print(f"{col}: {dfReviews[col].nunique()}")
    except TypeError:
        print(f"{col}: Unhashable type")

In [None]:
# Identify categorical and non-categorical columns, handling unhashable types
# The purpose of this code below  is to divide the dataset's columns into categories and non-categorical data types.
# Using pandas''select_dtypes' function, it first attempts to locate categorical columns (those with data types of 'object' or 'category').
# List comprehension is used to manually identify 'object' type columns as categorical in the event that a TypeError 
#is raised (due to unhashable types in columns). Next, the 'object' and 'category' data types are excluded in order to 
# identify the non-categorical columns.
# The function ends  by printing lists of both category and non-categorical columns. 
#This gives a clear picture of the structure of the dataset, which is crucial for further data analysis procedures.
try:
    categorical_cols = dfReviews.select_dtypes(include=['object', 'category']).columns.tolist()
except TypeError:
    categorical_cols = [col for col in dfReviews.columns if dfReviews[col].dtype == 'object']
non_categorical_cols = dfReviews.select_dtypes(exclude=['object', 'category']).columns.tolist()

print("Categorical Columns:")
print(categorical_cols)
print("\nNon-Categorical Columns:")
print(non_categorical_cols)

In [None]:
# Displaying the top 10 rows of the dataset for preliminary examination using pandas' 'head()' method. 
# The structure of the dataset, including column names and  values, is briefly summarized using this technique.
#  Verifying data loading and comprehending dataset layout are crucial aspects of data analysis.
# The output of the dataframe is styled via  set_properties
dataPath = 'jcpenney_reviewers.json'  
dfReviewers = pd.read_json(dataPath,lines="true")
print("First 10  rows of jcpenney_reviewers.json using head():")
dfReviewersStyled = dfReviewers.head(10).style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': '#AA4A44',
        'font-size': '10px',
        'width':'10%',
    }
)
# Output
dfReviewersStyled

In [None]:
# Displaying the last 10 rows of the dataset for preliminary examination using pandas' 'tail()' method. 
# The structure of the dataset, including column names and  values, is briefly summarized using this technique.
#  Verifying data loading and comprehending dataset layout are crucial aspects of data analysis.
# The output of the dataframe is also styled via  set_properties
print("Last 10  rows of jcpenney_reviewers.json using tail():")
dfReviewersStyledTail = dfReviewers.tail(10).style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'black',
        'background-color': '#DAF7A6',
        'font-size': '10px',
        'width':'10%',
    }
)
# output
dfReviewersStyledTail

In [None]:
# Display the number of rows and columns in the dataset using the the shape() method from pandas,
# the shape method returns back a tuple that reflects the DataFrame's dimensions.
print(f"\nThe dataset has {dfReviewers.shape[0]} rows and {dfReviewers.shape[1]} columns.\n")

In [None]:
# The code block below provides comprehensive dataset details,
# such as column data types and non-null counts using the info method from pandas
print("JcPenney Reviewers(json) Info:\n")
print(dfReviewers.info())

In [None]:
# The code block produces descriptive statistics for the datasets, 
# including max, min, and standard deviation. 
# This is accomplished by using the 'describe()' function, and transposing the result for easier reading.
# The DataFrame is also styled for improved visual appeal, including border radius, border style, text color, font size, and background color.
print("\nDescriptive Statistics for jcpenney_reviewers.json:")
productJsonReviews= dfReviewers.describe().style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': '#FFC300',
        'font-size': '10px',
        'width':'10%',
    }
)
# print(styled_df)
productJsonReviews

In [None]:
# The number of missing (null) values in each column of the dataset is determined and shown by this code block.
# The 'isnull()' function in pandas is used to find nulls in the DataFrame, and then'sum()' is used to aggregate the nulls column-wise.
# To find and properly manage missing data, this check is essential throughout the preprocessing stage of data.
print("Null Values in each column:")
print(dfReviewers.isnull().sum())

In [None]:
# Display the number of unique values in each column, handling unhashable types
# In the code block below an iteration through each columns of every dataset is done using pandas' 
# 'nunique()' to count unique values. A try-except block handles  unhashable types like lists or dictionaries,
#  outputting a custom message in such cases.  This helps in assessing data variability and identifying potential categorical columns.
print("Number of unique values in each column:")
for col in dfReviewers.columns:
    try:
        print(f"{col}: {dfReviewers[col].nunique()}")
    except TypeError:
        print(f"{col}: Unhashable type")

In [None]:
# Identify categorical and non-categorical columns.
# The purpose of this code below  is to divide the dataset's columns into categories and non-categorical data types. Using pandas''select_dtypes' function, it first attempts to locate categorical columns (those with data types of 'object' or 'category'). List comprehension is used to manually identify 'object' type columns as categorical in the event that a TypeError is raised (due to unhashable types in columns). Next, the 'object' and 'category' data types are excluded in order to identify the non-categorical columns.
# The function ends  by printing lists of both category and non-categorical columns. This gives a clear picture of the structure of the dataset, which is crucial for further data analysis procedures.
try:
    categorical_cols = dfReviewers.select_dtypes(include=['object', 'category']).columns.tolist()
except TypeError:
    categorical_cols = [col for col in dfReviewers.columns if dfReviewers[col].dtype == 'object']
non_categorical_cols = dfReviewers.select_dtypes(exclude=['object', 'category']).columns.tolist()

print("Categorical Columns:")
print(categorical_cols)
print("\nNon-Categorical Columns:")
print(non_categorical_cols)


<a id="2.0"></a>
### <b>2.0 <span style='color:#B21010'></span> Data Visualization</b>

In [None]:
# Plotting the distibution of categorical and non categorical variables for users.csv
# users.csv,since users.csv has no null values we can proceed to visualizing the rest of the data 
# as shon in tyhe EDA Null Values in each column in the Users dataset:
# Username    0
# DOB         0
# State       0
# dtype: int64
# Reload the dataset due to previous variable clearance and convert DOB to datetime format
# Let's correct the function based on the provided snippet and integrate all visualizations.


# Convert the 'DOB' column in the dfUsers DataFrame to datetime format. 
# Set errors='coerce' to replace any invalid dates with NaT (Not a Time) value.
dfUsers['DOB'] = pd.to_datetime(dfUsers['DOB'], errors='coerce')


# Function to visualize data in the given DataFrame.
# Takes two parameters: df - the DataFrame, ds_name - the name of the dataset.
def visualize_data(df, ds_name):

# This purpose of the code is to visualize data on users.csv. 
# The method generates a figure of a given size after printing the title for the visualization
# It determines how many distinct values there are in each column of the users.csv and displays these values as a bar graph. 
# After the graph has labels for the axes and a title, it is displayed.

# Print title for the data visualization
    print(f"Data Visualization for {ds_name} Dataset\n{'-'*30}")
# Create a figure with a specific size
    plt.figure(figsize=(15, 7))  
# Calculate the number of unique values per column
    n_unique = df.nunique()
# Plot a bar graph to visualize the number of unique values per column
    n_unique.plot(kind='bar')
# Set the title and labels for the graph
    plt.title("Number of Unique Values per Column")
    plt.ylabel("Number of Unique Values")
    plt.xticks(rotation=45)
# Display the graph
    plt.show()
       

# The code below's goal is to display the user distribution according to each state.
#  It generates a bar graph with the states shown on the x-axis and the number of users on the y-axis.
#  Title and axis labels are applied to the graph. To avoid overlapping, the x-axis labels are rotated by 90 degrees using the rotation=90 parameter.
#    In order to modify the layout and avoid label overlap, i used the plt.tight_layout() function.
#  The graph is finally shown.

# Create a figure with a specific size of 15 inches and 8 width
    plt.figure(figsize=(15, 8))
# Count the number of occurrences of each state in the 'State' column and plot a bar graph
    df['State'].value_counts().plot(kind='bar', color='teal')
# Set the title and labels for the graph
    plt.title('State Distribution of Users')
    plt.xlabel('State')
    plt.ylabel('Number of Users')
    plt.xticks(rotation=90)
# Adjust the layout to prevent overlapping of labels and display the graph
    plt.tight_layout()
    plt.show()



 
# This code block's role is to examine the top 10 states' categorical data according to the percentage of users in the dataset. 
# It generates a pie chart, with each slice denoting the percentage of users from a given state among all users. 
# Using the autopct='%1.1f%%' argument, the percentage value is shown on each slice. 
# The y-axis label is eliminated and the graph is given a title. The pie chart is finally displayed.

# Perform categorical data analysis on the top 10 states based on the proportion of users
    plt.figure(figsize=(20, 12))
# Count the number of occurrences of each state in the 'State' column and select the top 10 states
    df['State'].value_counts().head(10).plot(kind='pie', autopct='%1.1f%%')
# Set the title and remove the y-axis label
    plt.title('Top 10 States Proportion of Users')
    plt.ylabel('')
# Display the pie chart
    plt.show()

# Call the visualization function with the user data
visualize_data(dfUsers, 'Users')

<a id="3.0"></a>
<b>3.0 <span style='color:#B21010'></span> Data Validation</b>

In [None]:
# This code block below  eliminates null values from the df_products DataFrame.
# The dataset is reloaded again from the 'products.csv' file.
# The isnull().sum() method is used to do the initial check, which counts the sum  of null values in each column.
#  Dropna() is used to eliminate rows with null values, resulting in the creation of a new DataFrame called df_products_cleaned.
# After cleaning, a last check is made to determine how many null values there are.
# The console displays the outcomes of the first and last checks.

# Reload the dataset due to the execution state reset
df_products = pd.read_csv('products.csv')

# Initial check for null values before cleaning
null_values_before = df_products.isnull().sum()
print("Null values before cleaning:")
print(null_values_before)
# Remove rows with null values
df_products_cleaned = df_products.dropna()
# Final check for null values after cleaning
null_values_after = df_products_cleaned.isnull().sum()
print("\nNull values after cleaning:")
print(null_values_after)

<a id="4.0"></a>
<b>4.0 <span style='color:#B21010'>||</span> Data Visualization for reviews.csv</b>

In [None]:


df_reviews = pd.read_csv('reviews.csv')

# Visualization for Score Distribution
# This code uses a countplot to illustrate the distribution of review scores. 
# The figure is made using the countplot function from the sns module, and the x-axis variable is the 'Score' column from the df_reviews DataFrame. 
# A title, axis labels, and a figure size adjustment are applied to the plot to improve visibility. 
# The frequency of each score category is displayed in the resulting graphic.
plt.figure(figsize=(22, 12))
sns.countplot(x='Score', data=df_reviews, palette='viridis')
plt.title('Distribution of Review Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

# Find the review length
df_reviews['Review Length'] = df_reviews['Review'].str.len()

# Visualization for Review Length Distribution
# This code uses a histogram to display the distribution of review lengths. 
# The 'Review Length' column from the df_reviews DataFrame is used as the data in the plot, which is made using the histplot function from the sns module.
# To regulate the granularity of the histogram, the number of bins is set to thirty. A title, axis labels, and a figure size adjustment are applied to the plot to improve visibility.
# The frequency of review lengths within each bin is displayed in the resulting figure along with an overlaying kernel density estimate (kde) curve.

plt.figure(figsize=(22, 12))
sns.histplot(df_reviews['Review Length'], bins=30, color='orange', kde=True)
plt.title('Distribution of Review Lengths')
plt.xlabel('Review Length')
plt.ylabel('Frequency')
plt.show()



# Top 10 Products with the Most Reviews
# Using a count of each unique 'Uniq_id' in the 'Uniq_id' column of the df_reviews DataFrame, 
# this method determines the top 10 products with the most reviews. 
# The relevant product names are obtained by merging the top product IDs with the df_products DataFrame.
#  Next, using kind='bar', the names are shown as a bar graph. In order to improve readability,
#  the x-axis labels are rotated, and the graph is given a title and axis names.
top_products = df_reviews['Uniq_id'].value_counts().head(10).index
# Merge to get the product names using the Uniq_id
top_products_with_names = df_products[df_products['Uniq_id'].isin(top_products)]
# Ensure that 'Uniq_id' is set as the index in the top_products_with_names dataframe
top_products_with_names.set_index('Uniq_id', inplace=True)
# Map the names to the top_products series
top_product_names = top_products.map(top_products_with_names['Name'])
# Now plot with product names
plt.figure(figsize=(22, 12))
top_product_names.value_counts().plot(kind='bar', color='blue')
plt.title('Top 10 Products with the Most Reviews')
plt.xlabel('Product Name')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)
plt.show()


# Top 10 Users with the Most Reviews
# The top 10 users with the most reviews are displayed using this code. 
# The 'Username' column of the df_reviews DataFrame is where the value_counts() method is used to count the occurrences of each unique username.
#  Next, using kind='bar,' the counts are shown as a bar graph. 
# In order to improve readability, the x-axis labels are rotated, and the graph is given a title and axis names.
top_users = df_reviews['Username'].value_counts().head(10)
plt.figure(figsize=(22, 12))
top_users.plot(kind='bar', color='green')
plt.title('Top 10 Users with the Most Reviews')
plt.xlabel('Username')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)
plt.show()

<a id="5.0"></a>
<b>5.0 <span style='color:#B21010'> </span> Data analysis</b>

In [None]:
# Using NLTK's VADER sentiment intensity analyzer, this code analyzes customer reviews from jcpenney_products.json for sentiment.
# The SentimentIntensityAnalyzer() class from NLTK is used to initialize the VADER sentiment intensity analyzer.
# The 'jcpenney_products.json' file is read line by line into a list of dictionaries via a with open() block.
# Using a list comprehension, the data is flattened to produce a list of tuples (uniq_id, review_text).
# The flattened data is placed into a DataFrame called df_reviews, with columns called "uniq_id" and "review_text."
# To apply sentiment analysis to every review content, a function called analyze_sentiment is defined.
# The sentiment score is determined using the VADER analyzer, and the sentiment is categorized as "positive," "negative," or "neutral" depending on the compound score.
# The apply() method is used to apply the analyze_sentiment function to the'review_text' column of the df_reviews DataFrame.
# The sentiment result is then placed in a new column called'sentiment'.
# Using the value_counts() method on the'sentiment' column of the df_reviews DataFrame, the number of each sentiment category is tallied, 
# and the result is placed in the sentiment_counts variable.


# Initialize NLTK's VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

# Assuming the JSON data is stored in a file called 'products_reviews.jsonl'
with open('jcpenney_products.json', 'r') as file:
    data = [json.loads(line) for line in file.readlines()]

# Flatten the data to get a list of tuples (uniq_id, review_text)
reviews_data = [(item['uniq_id'], review['Review']) for item in data for review in item['Reviews']]

# Create a DataFrame from the reviews
df_reviews = pd.DataFrame(reviews_data, columns=['uniq_id', 'review_text'])

# Define a function to apply sentiment analysis
def analyze_sentiment(review_text):
    score = sia.polarity_scores(review_text)
    if score['compound'] >= 0.05:
        return 'positive'
    elif score['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply the function to get sentiment for each review
df_reviews['sentiment'] = df_reviews['review_text'].apply(analyze_sentiment)

# Count the number of each sentiment
sentiment_counts = df_reviews['sentiment'].value_counts()



# This code creates a bar chart to show the reviews' sentiment distribution. 
# It generates a bar plot with distinct colors for every sentiment category using the plot function from the sentiment_counts DataFrame. 
# Next, axis labels, rotation of the x-axis label, and a title are added to customize the plot.
# Lastly, each bar has annotations added to it that show how many reviews belong to each sentiment group. 
# The show function is used to display the generated plot.
# Plot the sentiment distribution
plt.figure(figsize=(22, 12))  # Create a new figure with a specific size
ax = sentiment_counts.plot(kind='bar', color=['green', 'gray', 'red'])  # Create a bar plot from sentiment_counts DataFrame
plt.title('Sentiment Analysis of Reviews')  # Set the title of the plot
plt.xlabel('Sentiment')  # Set the label for the x-axis
plt.ylabel('Number of Reviews')  # Set the label for the y-axis
plt.xticks(rotation=0)  # Set the rotation of x-axis labels to 0 degrees

# Adding annotations
for p in ax.patches:  # Iterate over each bar patch in the plot
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))  # Add the height of each bar as an annotation

# Display the graph
plt.show()  

In [None]:
# Reviews are categorized using the code block below, and they are then saved in a new file for additional examination.
# It starts by setting up the VADER sentiment intensity analyzer for NLTK.
# Next, the 'jcpenney_products.json' file's JSON data is loaded.
# The function "get_sentiment" is then defined; it accepts a text parameter and returns the sentiment classification determined by the VADER analysis.
# The method thereafter iterates over every item in the dataset and every review included in the item's 'Reviews' list.
# The sentiment categorization for each review is obtained by calling the 'get_sentiment' function, and it is then added to the review's 'Tone' attribute.
#  the altered data is saved to a brand-new JSON file named "modified_reviews.json."
#Ultimately, the new file is read in order to view the newly added field

# Initialize NLTK's VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

# Load the JSON data
with open('jcpenney_products.json', 'r') as file:
    data = [json.loads(line) for line in file.readlines()]

# Function to classify sentiment
def get_sentiment(text):
    score = sia.polarity_scores(text)
    if score['compound'] >= 0.05:
        return 'positive'
    elif score['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Extracting reviews and their sentiments
review_data = []
for item in data:
    for review in item['Reviews']:
        review_tone = get_sentiment(review['Review'])
        review_data.append({'Review': review['Review'], 'Tone': review_tone})

# Save the review data to a new JSON file
with open('reviews_with_tone.json', 'w') as outfile:
    json.dump(review_data, outfile, indent=4)

# read the head of the new file
try:
    # Attempt to read the file as a standard JSON array of objects
    newFile = pd.read_json('reviews_with_tone.json')
except ValueError:
    # If there is a ValueError, attempt to read it as line-delimited JSON
    newFile = pd.read_json('reviews_with_tone.json', lines=True)
# Get the first 10 rows of the file and apply styling
outputOfNewModifiedFile = newFile.head(10).style.set_properties(
    **{
        'border': '1.3px solid white',
        'color': 'white',
        'background-color': 'black',
        'font-size': '10px',
    }
)
# output
outputOfNewModifiedFile

# REFERENCES
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html#pandas.DataFrame.isnull
- https://www.analyticsvidhya.com/blog/2021/06/top-15-pandas-data-exploration-functions/
- https://canvas.stir.ac.uk/courses/13894/pages/functions-with-arguments?module_item_id=738358