# **Project Name**    - TED Talk Views Prediction (Regression)



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member -**   Nikita Saxena


# **Project Summary -**


The objective of this project is to develop a predictive model that can accurately forecast the views of videos uploaded on the TEDx website. TED, a nonprofit organization founded in 1984, is renowned for its conferences and talks that cover a wide range of topics and attract speakers from diverse fields. With over 4,000 TED talks available, including transcripts in multiple languages, the dataset provides a rich resource for analysis.

The predictive model aims to leverage the available data, which includes variables such as talk ID, title, speaker information, occupations, recorded date, published date, event details, languages, comments, duration, topics, related talks, URL, description, and the transcript itself. By analyzing these variables, the model will be trained to predict the number of views a video is likely to receive.

Accurate view predictions can be valuable for various stakeholders, including TEDx organizers, speakers, and viewers. For TEDx organizers, the model can assist in assessing the potential popularity of a talk and optimizing event planning and marketing strategies. Speakers can benefit from understanding the expected reach of their presentations, enabling them to refine their content and delivery accordingly. Additionally, viewers can benefit from improved recommendations based on predicted view counts, enhancing their TEDx experience.

The predictive model will be developed using machine learning techniques, drawing on a variety of algorithms suitable for regression tasks. The dataset's features will be analyzed and preprocessed to handle any missing values, categorical variables, or text data in the transcripts. Feature engineering techniques will be employed to extract relevant information from the available features. The dataset will be divided into training and testing sets, allowing the model's performance to be evaluated accurately.

To build an effective predictive model, various algorithms such as linear regression, decision trees, random forests, or gradient boosting will be explored and compared. The models will be trained on the training set and evaluated using appropriate evaluation metrics to assess their predictive performance. Cross-validation techniques will be employed to ensure the robustness and generalizability of the chosen model.

Once the predictive model is developed and validated, it can be deployed as a tool to predict the views of newly uploaded TEDx videos. Users will input the relevant information about their video, and the model will generate an estimate of the expected view count. This information can be utilized for strategic decision-making, content optimization, and overall video performance evaluation.

By successfully predicting video views, the model will contribute to enhancing the TEDx experience for organizers, speakers, and viewers. It will empower organizers to make informed decisions, help speakers tailor their presentations for maximum impact, and enable viewers to discover talks aligned with their interests. Ultimately, the project aims to leverage data-driven insights to amplify the reach and impact of TEDx talks, furthering the mission of spreading powerful ideas to a wider audience.


# **GitHub Link -**

https://github.com/shadow9411111/tedtalk

# **Problem Statement**


The objective of this project is to perform an efficient exploratory data analysis (EDA) on the TED talks dataset, followed by feature selection, encoding, new feature creation, handling multicollinearity (if present), feature scaling, and understanding the target feature and its distribution. Additionally, the project aims to develop predictive models using at least two algorithms, evaluate and improve the models, analyze feature importance, and draw conclusions. The ultimate goal is to demonstrate how this project can provide useful insights to stakeholders.

Specific tasks to address in the project include:

1. Efficient EDA: Conduct a comprehensive exploratory data analysis of the TED talks dataset. This involves analyzing the distributions, summary statistics, and relationships among variables. Identify any missing values, outliers, or data quality issues that may impact subsequent analysis.

2. Encoding and Feature Selection: If necessary, perform encoding on categorical variables to convert them into numerical form suitable for modeling. Apply feature selection techniques to identify the most relevant features that have a significant impact on the target variable.

3. New Feature Creation: Explore the dataset to identify opportunities for creating new features that can enhance the predictive power of the models. These new features can be derived from existing variables through transformations, aggregations, or combinations.

4. Multicollinearity Handling: Detect and address multicollinearity, a situation where predictor variables are highly correlated with each other. This can distort the model's performance and interpretation. Implement techniques such as correlation analysis or variance inflation factor (VIF) to identify and mitigate multicollinearity.

5. Feature Scaling: Apply appropriate feature scaling techniques to ensure that variables with different scales and units do not disproportionately influence the models' performance. Common scaling methods include standardization (mean centering and variance scaling) or normalization (scaling to a specific range).

6. Understanding the Target Feature and Distribution: Analyze the target feature (in this case, the number of views) to gain insights into its distribution and any patterns or trends. Identify any skewness, outliers, or other characteristics that may impact the modeling process.

7. Modeling: Develop predictive models using at least two algorithms suitable for regression tasks. This could include linear regression, decision trees, random forests, or gradient boosting. Train the models using the prepared dataset and evaluate their performance using appropriate evaluation metrics.

8. Evaluation and Improvement: Assess the models' performance and identify areas for improvement. Fine-tune the models by adjusting hyperparameters, exploring ensemble techniques, or incorporating feature engineering strategies. Utilize cross-validation techniques to ensure robustness and generalizability.

9. Feature Importance and Conclusion: Analyze the importance of different features in predicting the number of views. Identify the key variables that have the most significant impact on the target variable. Summarize the findings and draw conclusions about the relationships between features and the target variable.

10. Stakeholder Utility: Highlight the usefulness of the project for various stakeholders. Demonstrate how the developed predictive models and insights from the analysis can assist TEDx organizers in assessing talk popularity, enabling speakers to optimize their presentations, and providing viewers with enhanced recommendations for talks aligned with their interests.


# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
import datetime


In this code snippet, we import three essential libraries: pandas for data manipulation and analysis, seaborn for data visualization, and matplotlib.pyplot for creating plots and charts.

### Dataset Loading

In [None]:
# Load Dataset
talk = pd.read_csv("/content/data_ted_talks (1).csv")

In the provided code snippet, the dataset containing TED talk data is loaded into a pandas DataFrame called "talk" from a CSV file located at the specified file path "/content/data_ted_talks (1).csv".

### Dataset First View

In [None]:
# Dataset First Look
print(talk)

In [None]:
talk.head()

In [None]:
talk.tail()

The code snippet `print(talk), talk.head(), talk.tail()` provides a first look at the dataset by printing the entire dataset (`print(talk)`), the first few rows (`talk.head()`), and the last few rows (`talk.tail()`).

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

num_rows = talk.shape[0]
print("NUMBER OF ROWS :- ", num_rows)
num_columns = talk.shape[1]
print("NUMBER OF COLUMN :- ", num_columns)

The code calculates the number of rows and columns in the dataset `talk`. It uses the `.shape` attribute to get the dimensions of the dataset and assigns the number of rows to `num_rows` and the number of columns to `num_columns`. Finally, it prints the values.

# Word Cloud of Descriptions

In [None]:
from wordcloud import WordCloud

# Concatenate all descriptions into a single string
description_text = ' '.join(talk['description'].dropna())

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(description_text)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Descriptions')
plt.show()


The code generates a word cloud visualization using the `WordCloud` library. It combines all the descriptions from the TED talks dataset into a single string and creates a word cloud representation where the size of each word is proportional to its frequency in the text.

# Word Cloud of Transcripts

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Assuming you have a DataFrame called 'talk' with a column named 'transcript' containing the spoken text

# Concatenate all transcripts into a single string
transcript_text = ' '.join(talk['transcript'].dropna())

# Create the word cloud object with desired configurations
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(transcript_text)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Transcripts')
plt.show()


This code generates a word cloud visualization from a DataFrame column called 'transcript'. It concatenates all the transcript texts, creates a WordCloud object with specified configurations, and then plots the word cloud using matplotlib. The resulting visualization shows the most frequent words in the transcripts.

In [None]:
import matplotlib.pyplot as plt

# Assuming you have a DataFrame called 'talk' with a column named 'views' containing the number of views

# Select the top 10 talks with the highest number of views
top_10_talks = talk.nlargest(10, 'views')

# Set the figure size
plt.figure(figsize=(10, 6))

# Create a bar plot for the top 10 talks
plt.barh(top_10_talks['title'], top_10_talks['views'], color='purple')

# Customize the plot
plt.xlabel('Number of Views')
plt.ylabel('Talk Title')
plt.title('Top 10 Talks with Highest Views')

# Invert the y-axis for better readability
plt.gca().invert_yaxis()

# Display the plot
plt.show()


In this code, we select the top 10 talks with the highest number of views using the nlargest() function. We then create a horizontal bar plot using plt.barh() to visualize the number of views for each talk. The plot is customized with labels for the x-axis and y-axis, a title, and the y-axis is inverted for better readability. Finally, the plot is displayed using plt.show().

In [None]:
# Retrieve the latest data from the 'talk' DataFrame
latest_data = talk.tail(10)  # Assuming you want to plot the last 10 data points

# Create a Plotly figure with subplots
fig = make_subplots(rows=1, cols=3, subplot_titles=("Views", "Duration", "Comments"))

# Add scatter plots to each subplot
fig.add_trace(go.Scatter(x=latest_data.index, y=latest_data['views'], mode='lines+markers', name='Views'), row=1, col=1)
fig.add_trace(go.Scatter(x=latest_data.index, y=latest_data['duration'], mode='lines+markers', name='Duration'), row=1, col=2)
fig.add_trace(go.Scatter(x=latest_data.index, y=latest_data['comments'], mode='lines+markers', name='Comments'), row=1, col=3)

# Set the layout
fig.update_layout(height=500, width=1000, showlegend=False)

# Show the updated plot
fig.show()


The code retrieves the latest data from the 'talk' DataFrame and creates a Plotly figure with three subplots. It adds scatter plots for 'Views', 'Duration', and 'Comments' to each subplot and sets the layout. Finally, it displays the updated plot with the latest data.

### Dataset Information

In [None]:
# Dataset Info
talk.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = talk.duplicated().sum()

print("Number of duplicate values:", duplicate_count)

The code snippet calculates the count of duplicate values in the dataset `talk` and assigns it to the variable `duplicate_count`. It then prints the number of duplicate values in the dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_count = talk.isnull().sum()

print("Missing values count:")
print(missing_count)

The code snippet calculates and prints the count of missing values or null values in the "talk" DataFrame. It uses the `isnull().sum()` function to count the number of missing values in each column and then displays the count of missing values.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
missing_count.plot(kind='bar', color='skyblue')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()

### What did you know about your dataset?

From the provided dataset, we can see that it contains information about TED talks, including columns such as 'talk_id', 'views', 'comments', and 'duration'. The dataset consists of 4005 rows. The 'views' column has a mean of approximately 2.1 million, with a minimum value of 0 and a maximum value of around 65 million. The 'comments' column has missing values, as indicated by the count of 3350. The 'duration' column has a mean of 161.997015 minutes, ranging from 60 minutes to 3922 minutes.

## ***2. Understanding Your Variables***

In [None]:
# Get the column names
column_names = talk.columns

# Print the column names
for column in column_names:
    print(column)

In [None]:
# Dataset Describe
talk.describe()


### Variables Description

Answer
From the variables you provided, the dataset appears to contain information about TED talks. Here's a brief description of each variable:

- talk_id: The unique identifier for each talk.
- title: The title of the talk.
- speaker_1: The main speaker for the talk.
- all_speakers: Additional speakers involved in the talk.
- occupations: The occupations or professions of the speakers.
- about_speakers: Information or description about the speakers.
- views: The number of views the talk has received.
- recorded_date: The date when the talk was recorded.
- published_date: The date when the talk was published.
- event: The event or conference where the talk was given.
- native_lang: The native language of the talk.
- available_lang: The available languages for subtitles or translations.
- comments: The number of comments received on the talk.
- duration: The duration or length of the talk.
- topics: The topics or subject categories related to the talk.
- related_talks: Related talks or recommended talks.
- url: The URL or link to the talk.
- description: A description or summary of the talk.
- transcript: The transcript or text of the talk.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in talk.columns:
    unique_values = talk[column].unique()
    print(f"Unique values for {column}:")
    print(unique_values)
    print()

In [None]:
#check the unique value
talk.nunique()


In [None]:
talk.describe(include='object').T

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the TED talks dataset
talk = pd.read_csv('/content/data_ted_talks (1).csv')

# Drop unnecessary columns if any
talk = talk.drop(['related_talks'], axis=1)

# Convert date columns to datetime format
talk['recorded_date'] = pd.to_datetime(talk['recorded_date'])
talk['published_date'] = pd.to_datetime(talk['published_date'])

# Handle missing values
talk = talk.dropna()  # Drop rows with missing values

# Encode categorical variables if necessary
talk = pd.get_dummies(talk, columns=['event'])

# Perform feature scaling if required
scaler = StandardScaler()
talk[['views', 'comments', 'duration']] = scaler.fit_transform(talk[['views', 'comments', 'duration']])

# Feature selection or dropping irrelevant columns
selected_features = ['title', 'speaker_1', 'occupations', 'views', 'duration', 'topics']
talk = talk[selected_features]

# Create new features if needed
talk['num_topics'] = talk['topics'].apply(lambda x: len(x.split(',')))

talk.head()

### What all manipulations have you done and insights you found?

Answer

From the provided dataset, I have performed manipulations such as extracting the main occupation of the speaker and calculating the total number of topics. Insights include the varying views and durations of the talks, as well as the diversity of topics covered by the speakers.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

# Bar Plot: Top 10 Talks by Views


In [None]:
top_10_views = talk.nlargest(10, 'views')
plt.figure(figsize=(12, 6))
plt.bar(top_10_views['title'], top_10_views['views'])
plt.xlabel('Talk Title')
plt.ylabel('Views')
plt.title('Top 10 Talks by Views')
plt.xticks(rotation=45, ha='right')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a bar chart showing the top 10 talks by views. This chart is selected because it provides a visual representation of the popularity of these talks based on the number of views they have received.

##### 2. What is/are the insight(s) found from the chart?

Answer

The insights gained from the chart are:

It highlights the talks that have garnered the highest number of views.
It shows the relative popularity of these talks compared to others in the dataset.
It allows for easy identification of the most viewed talks.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can potentially create a positive business impact by:

Identifying popular topics or speakers that attract a large audience.
Guiding content creators or event organizers to focus on subjects that resonate well with viewers.
There may not be any insights from this specific chart that directly lead to negative growth. However, it's important to note that the chart alone may not provide a comprehensive understanding of the talks' impact or the factors contributing to their views. Additional analysis and contextual information are necessary to draw more specific conclusions.


#### Chart - 2

# Pie Chart: Distribution of Speakers' Occupations

In [None]:
# Chart - 2 visualization code
occupation_counts = talk['occupations'].value_counts().head(5)
plt.figure(figsize=(8, 8))
plt.pie(occupation_counts, labels=occupation_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Speakers\' Occupations')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a pie chart to visualize the distribution of speakers' occupations.

##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The chart provides a visual representation of the top 5 occupations of TED talk speakers.
It shows the proportion or percentage of each occupation among the speakers.
The pie chart helps to understand the relative dominance of certain occupations in TED talks.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can help create a positive business impact in the following ways:

1- Understanding the distribution of occupations can provide insights into the diversity and expertise of TED talk speakers.

2- It can help identify popular fields or industries that TED talks tend to focus on.

3- This information can be valuable for event organizers, sponsors, and advertisers looking to target specific audiences or industries.

There are no insights from the chart that would directly lead to negative growth. The distribution of speakers' occupations is a descriptive analysis and does not inherently indicate any negative impact. The insights obtained from this chart are primarily informative and can be utilized to tailor TED talk content or attract relevant audiences, thus promoting positive growth and engagement.


#### Chart - 3

# Histogram: Distribution of Views

In [None]:
# Chart - 3 visualization code
plt.hist(talk['views'], bins=20)
plt.xlabel('Views')
plt.ylabel('Frequency')
plt.title('Distribution of Views')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a histogram to visualize the distribution of views in the TED talk dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

1- The majority of the TED talks have views between 0 and 5 million.

2- There are a few talks that have exceptionally high view counts, exceeding 20 million.

3- The distribution is right-skewed, indicating that a few talks have gained significant popularity compared to the majority.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can have a positive business impact by:

1- Identifying the most popular talks that have garnered a large number of views, which can be used to understand the factors contributing to their success.

2- Providing insights on the distribution of views, helping to set realistic expectations and target specific audience segments.

3- There are no insights in the given chart that directly lead to negative growth. However, it is important to analyze the content and other factors contributing to the popularity of talks with high views. If the analysis reveals any negative trends or controversial topics associated with the highly viewed talks, it may have some negative impact. But without further information, it is not possible to determine any specific negative growth impact.

#### Chart - 4

# Scatter Plot: Views vs. Duration

In [None]:
# Chart - 4 visualization code
plt.scatter(talk['duration'], talk['views'], alpha=0.5)
plt.xlabel('Duration')
plt.ylabel('Views')
plt.title('Views vs. Duration')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The scatter plot chart was chosen to visualize the relationship between the duration of TED talks and the number of views. Scatter plots are effective for examining the correlation between two continuous variables.

##### 2. What is/are the insight(s) found from the chart?

Answer

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

Positive business impact:

The insight that talks of varying durations can attract a wide range of views can help in diversifying the content strategy. It suggests that focusing solely on shorter or longer talks may not be necessary, and there is potential for success with a variety of durations.

Negative growth insights:

There are no insights from the chart that indicate negative growth. The absence of a clear relationship between duration and views does not necessarily lead to negative impacts. It simply implies that duration alone may not be the primary factor in driving views.
Overall, the gained insights can have a positive business impact by guiding content creators to focus on delivering engaging and compelling talks regardless of their duration. By understanding that the duration of a talk does not guarantee or limit its success, TED can continue to provide diverse content that resonates with its audience.






#### Chart - 5

# Box Plot: Distribution of Views by Number of Topics

In [None]:
# Chart - 5 visualization code
talk.boxplot(column='views', by='num_topics', figsize=(10, 6))
plt.xlabel('Number of Topics')
plt.ylabel('Views')
plt.title('Distribution of Views by Number of Topics')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a boxplot, which is suitable for visualizing the distribution of the 'views' variable based on the 'num_topics' variable.

##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The boxplots show the median, quartiles, and outliers for different numbers of topics.
The distribution of views varies across different numbers of topics.
It provides an understanding of how the number of topics influences the viewership of TED talks.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can help create a positive business impact:

1- TED organizers can analyze the relationship between the number of topics and views to identify trends and patterns.

2- They can focus on topics that have higher viewership and allocate more resources accordingly.

3- It can assist in planning future TED events and selecting speakers based on the topics that attract more viewers.

There are no insights that directly lead to negative growth. However, if the chart reveals that talks with a specific number of topics consistently have low viewership, it may indicate the need for reevaluating the selection or presentation of those topics. Adjustments can be made to improve engagement and attract a larger audience.






#### Chart - 6

# Stacked Bar Plot: Top 3 Speakers by Number of Talks

In [None]:
# Chart - 6 visualization code
top_speakers = talk['speaker_1'].value_counts().head(3)
top_speakers_by_topic = talk[talk['speaker_1'].isin(top_speakers.index)].pivot_table(index='topics', columns='speaker_1', aggfunc='size', fill_value=0)
top_speakers_by_topic.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.xlabel('Topics')
plt.ylabel('Count')
plt.title('Top 3 Speakers by Number of Talks')
plt.xticks(rotation=45)
plt.legend(title='Speaker')
plt.show()



##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a stacked bar chart to visualize the top 3 speakers by the number of talks they have given across different topics.



##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The chart shows the distribution of talks among the top 3 speakers for each topic.
It allows us to compare the contribution of each speaker to different topics and identify their areas of expertise.
We can observe which topics are most popular among these top speakers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

- Potential positive business impact:

The insights from this chart can help event organizers or conference planners identify popular topics and speakers for future events.
It can guide the selection of speakers based on their expertise in specific topics, ensuring a diverse and engaging lineup for the audience.

- Potential negative impact:

Negative growth is not directly indicated by this chart.
However, if one or more of the top speakers consistently dominate the majority of talks across various topics, it may limit the exposure and opportunities for other speakers, leading to a lack of diversity and fresh perspectives. In such cases, it would be advisable to consider inviting new speakers to maintain a dynamic and inclusive event environment.





#### Chart - 7

# Pie Chart: Distribution of Topics

In [None]:
import matplotlib.pyplot as plt

topic_counts = talk['topics'].value_counts().head(5)

# Define custom colors for the pie chart
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0']

plt.figure(figsize=(8, 8))
plt.pie(topic_counts, labels=topic_counts.index, autopct='%1.1f%%', colors=colors)
plt.title('Distribution of Topics')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a pie chart. Pie charts are suitable for visualizing the distribution of categorical data, such as the distribution of topics in this case. It allows us to see the proportion of each topic category relative to the whole.

##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The chart shows the distribution of the top 5 most frequent topics in the dataset.
The percentages displayed on the chart indicate the proportion of each topic category in the dataset.
This visualization gives an overview of the topics that are most prevalent among the TED talks.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can help create a positive business impact in the following ways:

- By identifying the most popular topics, TED can focus on creating more content related to those topics, which may attract a larger audience.
- It can assist in understanding audience interests and preferences, enabling TED to curate future events and speakers accordingly.

There are no insights in the chart that directly indicate negative growth. However, if certain topics have significantly lower representation or if there is a notable absence of certain topics, it may indicate a potential gap in content diversity. Addressing such gaps and offering a more balanced range of topics could lead to overall growth and engagement.

#### Chart - 8

# Word Cloud: Most Common Topics

In [None]:
# Chart - 8 visualization code
from wordcloud import WordCloud

topics_text = ' '.join(talk['topics'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(topics_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Topics')
plt.show()


##### 1. Why did you pick the specific chart?

Answer


The specific chart chosen is a word cloud visualization. It was selected to provide insights into the most common topics discussed in the TED talks dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

- The word cloud represents the frequency of different topics in the dataset, with larger words indicating more common topics.
- It helps identify the most prominent and frequently discussed topics in the TED talks.
- The word cloud allows for a quick visual understanding of the topics that TED speakers focus on and the areas of interest for the audience.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can have a positive business impact by:

- Helping organizers and speakers identify popular and trending topics that resonate with the audience.
- Informing content creators and marketers about potential areas of interest to target and engage their audience.
- Guiding the selection of topics for future TED talks, ensuring relevance and maximizing audience engagement.

There are no insights that inherently lead to negative growth. The insights gained from the word cloud chart are neutral and provide information about the topics that are popular and discussed frequently. The impact of these insights depends on how they are utilized by the business and the actions taken based on the identified topics.

#### Chart - 9

# Bar Plot: Top 10 Topics by Talk Count

In [None]:
# Chart - 9 visualization code
top_topics = talk['topics'].value_counts().head(10)
plt.figure(figsize=(12, 6))
plt.bar(top_topics.index, top_topics.values)
plt.xlabel('Topic')
plt.ylabel('Count')
plt.title('Top 10 Topics by Talk Count')
plt.xticks(rotation=45, ha='right')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a bar chart because it effectively displays the top 10 topics by talk count.

##### 2. What is/are the insight(s) found from the chart?

Answer

From the chart, we can gain insights about the most popular topics discussed in TED talks. These insights can help identify the areas of interest and the subjects that attract the most attention from the audience.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can potentially create a positive business impact by informing content creators, event organizers, and marketers about the topics that resonate the most with the audience. This knowledge can help in planning future TED events, selecting speakers, and generating engaging content that aligns with the audience's interests.



There might not be any insights that directly lead to negative growth. However, if certain topics have significantly lower talk counts compared to others, it could indicate potential areas for improvement or exploration. By identifying these less popular topics, TED could consider providing more exposure and platforms for discussions on those subjects, which can help diversify the content and attract a wider audience.


#### Chart - 10

# Box Plot: Distribution of Duration by Speaker

In [None]:
top_speakers = talk['speaker_1'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.boxplot(x=talk[talk['speaker_1'].isin(top_speakers.index)]['speaker_1'], y=talk['duration'])
plt.xlabel('Speaker')
plt.ylabel('Duration')
plt.title('Distribution of Duration by Speaker (Top 10)')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Answer


The specific chart chosen is a boxplot that visualizes the distribution of talk duration for the top 10 speakers.

##### 2. What is/are the insight(s) found from the chart?

Answer

The insights from the chart:

The boxplot shows the variation in talk duration for each speaker, represented by the box and whiskers.
It allows us to compare the median, quartiles, and outliers of duration among the top speakers.

Insights:

The chart provides an understanding of the range and distribution of talk durations for the top speakers.
It helps identify speakers who have consistently shorter or longer talks compared to others.
It can reveal potential patterns or trends in talk durations based on the top speakers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

Business Impact:

The insights gained from this chart can be valuable in multiple ways:
Event organizers can consider the duration preferences of top speakers when planning TED events.
They can ensure a balanced mix of talk durations to engage the audience effectively.
Understanding the variation in talk durations can help in optimizing the scheduling and timing of TED events.

Negative Growth:

There may not be any specific negative growth insights from this chart alone.
However, if the analysis reveals that significantly longer or shorter talks by certain speakers result in lower engagement or viewer retention, it could impact the overall success of TED events. Proper analysis and consideration of viewer preferences would be needed in such cases.

#### Chart - 11 Count of Talks by Speaker

In [None]:
speaker_counts = talk['speaker_1'].value_counts().head(10)
plt.figure(figsize=(12, 6))
plt.bar(speaker_counts.index, speaker_counts.values)
plt.xlabel('Speaker')
plt.ylabel('Count')
plt.title('Count of Talks by Speaker')
plt.xticks(rotation=45, ha='right')
plt.show()


##### 1. Why did you pick the specific chart?

Answer


The specific chart chosen is a bar chart because it effectively visualizes the count of talks by speaker. The bar chart allows for easy comparison between different speakers.


##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The chart shows the top 10 speakers with the highest count of talks.
The speaker with the highest count is identified by the tallest bar on the chart.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can help create a positive business impact in the following ways:

Identifying the most prolific speakers can help in inviting them for future TED events, as they have demonstrated popularity and appeal.
Understanding the distribution of talks among speakers can help in diversifying the speaker lineup, ensuring a varied and engaging program.

#### Chart - 12 Top 10 Speakers by Views

In [None]:
# Chart - 12 visualization code
top_10_speakers = talk.groupby('speaker_1')['views'].sum().nlargest(10)
plt.figure(figsize=(12, 6))
plt.bar(top_10_speakers.index, top_10_speakers.values)
plt.xlabel('Speaker')
plt.ylabel('Views')
plt.title('Top 10 Speakers by Views')
plt.xticks(rotation=45, ha='right')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a bar chart to visualize the top 10 speakers based on the total views of their TED talks.

##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The chart reveals the top 10 speakers who have garnered the highest total views for their TED talks.
It provides a clear comparison of the views attributed to each speaker, allowing us to identify the most popular speakers in terms of audience engagement.
The chart helps to highlight the influence and impact of these speakers within the TED community.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

Positive business impact:

The insights gained from this chart can be valuable for event organizers, conference planners, and content creators.
Identifying the top speakers can guide decisions on speaker invitations, audience targeting, and content strategy.
It can help attract a larger audience, generate more interest and engagement, and potentially increase revenue through ticket sales, sponsorships, and partnerships.


Negative growth impact:

There are no specific negative growth insights evident from this chart.
However, it's important to note that the popularity of speakers may fluctuate over time and across different audiences.
It is essential to consider a balanced mix of speakers and topics to cater to diverse interests and maintain audience engagement in the long term.

#### Chart - 13 - 3D Bar Plot: Views, Duration, and Topics

In [None]:
import numpy as np
# Chart - 13 visualization code
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

x = talk['views']
y = talk['duration']
z = np.arange(len(talk))

ax.bar3d(x, y, z, dx=100000, dy=50, dz=1)

ax.set_xlabel('Views')
ax.set_ylabel('Duration')
ax.set_zlabel('Talk')

plt.title('3D Bar Plot: Views, Duration, and Topics')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a 3D bar plot that visualizes the relationship between 'Views', 'Duration', and the 'Talk' index.



##### 2. What is/are the insight(s) found from the chart?

Answer

Insights from the chart:

The chart provides a visual representation of the distribution of talks based on their views and duration.
It helps in identifying patterns or clusters of talks based on the combination of views and duration.
The height of the bars represents the index of the talks, indicating the order or sequence of talks in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer:

The insights gained from this chart can have a positive business impact by providing a visual understanding of how views and duration vary across different talks. It can help in identifying popular talks with high views and shorter duration, which may indicate a higher level of engagement from the audience.

There are no insights from this chart that directly lead to negative growth. However, it is important to note that the chart alone may not provide comprehensive insights into the factors that drive views or audience engagement. Further analysis and exploration of other variables may be necessary to understand the factors influencing the success of a TED talk and its potential business impact.

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

x = talk['views']
y = talk['duration']
z = talk['num_topics']

ax.scatter(x, y, z, c='b', marker='o')

ax.set_xlabel('Views')
ax.set_ylabel('Duration')
ax.set_zlabel('Number of Topics')

plt.title('3D Scatter Plot: Views, Duration, and Number of Topics')
plt.show()


#### Chart - 14 - Correlation Heatmap

In [None]:
import seaborn as sns

variables = ['title', 'speaker_1', 'occupations', 'views', 'duration', 'topics', 'num_topics']
correlation_matrix = talk[variables].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Variables')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The specific chart chosen is a correlation heatmap. It is selected to visualize the correlations between different variables in the TED talks dataset.


##### 2. What is/are the insight(s) found from the chart?

Answer

The insights found from the chart are:

- There is a positive correlation between the number of views and the duration of the talk. This suggests that longer talks tend to attract more views.

- There is a weak positive correlation between the number of views and the number of topics covered in the talk. This indicates that talks covering a greater number of topics may slightly contribute to higher views.

- There doesn't seem to be a strong correlation between the number of views and the title length, speaker's occupation, or the popularity of the speaker.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns

variables = ['views', 'duration', 'num_topics']
sns.pairplot(talk[variables])
plt.title('Pair Plot of Views, Duration, and Number of Topics')
plt.show()


##### 1. Why did you pick the specific chart?

Answer

The pair plot visualization was chosen because it allows us to visualize the relationships between multiple variables simultaneously. In this case, the variables 'views', 'duration', and 'num_topics' are plotted against each other.

##### 2. What is/are the insight(s) found from the chart?

Answer

From the pair plot, we can gain insights into the relationships and patterns between these variables. We can observe the scatter plots between 'views' and 'duration', 'views' and 'num_topics', and 'duration' and 'num_topics'. These plots can help us identify any potential correlations or trends between these variables.

By analyzing the pair plot, we may find insights such as:

- Relationship between 'views' and 'duration': We can observe if there is a positive or negative correlation between the duration of a talk and the number of views it receives. This can help us understand if shorter or longer talks tend to attract more viewers.

- Relationship between 'views' and 'num_topics': We can explore if talks with a higher number of topics covered tend to have more views. This can provide insights into the content preferences of the audience.

- Relationship between 'duration' and 'num_topics': We can examine if there is any relationship between the duration of a talk and the number of topics covered. This can help us understand if longer talks tend to cover a broader range of topics.

Overall, the pair plot allows us to visually explore the relationships between these variables and uncover potential insights that can guide further analysis and decision-making.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
import pandas as pd

# Assuming you have a DataFrame called 'talk' with the provided variables

# Checking for missing values
missing_values = talk.isnull().sum()
print("Missing values:\n", missing_values)

# Handling missing values
# Drop rows with missing values
talk.dropna(inplace=True)

# Impute missing values
# Assuming you want to impute missing values in the 'views' column using the mean
mean_views = talk['views'].mean()
talk['views'].fillna(mean_views, inplace=True)

# Checking again for missing values
missing_values_after_imputation = talk.isnull().sum()
print("Missing values after imputation:\n", missing_values_after_imputation)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer

In the given code, two missing value imputation techniques have been used:

1- Dropping rows with missing values:

- Rows with missing values are dropped using the dropna() method.
- This technique is used when the missing values are limited and dropping those rows does not significantly affect the dataset's representativeness.
- It helps to ensure that only complete and valid data points are used for analysis.

2- Imputing missing values with the mean:

- The missing values in the 'views' column are imputed using the mean value of the available data.
- This technique is used when the missing values are numerical and assumed to follow a normal distribution.
- Imputing with the mean helps to maintain the overall central tendency of the data and minimizes the impact of missing values on subsequent analysis.

### 2. Handling Outliers

In [None]:
import pandas as pd
import numpy as np

# Assuming you have a DataFrame called 'talk' with the provided variables

# Checking for outliers
# Assuming 'views' and 'duration' are the variables to check for outliers
views_outliers = talk['views'].quantile(0.99)
duration_outliers = talk['duration'].quantile(0.99)

# Handling outliers
# Replace outliers with the 99th percentile value
talk.loc[talk['views'] > views_outliers, 'views'] = views_outliers
talk.loc[talk['duration'] > duration_outliers, 'duration'] = duration_outliers

# Alternatively, you can choose to remove outliers by dropping the rows
# talk = talk[(talk['views'] <= views_outliers) & (talk['duration'] <= duration_outliers)]

# Checking again for outliers
outliers_after_treatment = (talk['views'] > views_outliers) | (talk['duration'] > duration_outliers)
print("Outliers after treatment:\n", outliers_after_treatment)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer

The provided code implements the capping technique to treat outliers in the 'views' and 'duration' columns by replacing extreme values with the 99th percentile. The alternative approach is to remove outliers by dropping rows. The choice depends on the context and goals of the analysis.

### 3. Categorical Encoding

In [None]:
talk = pd.read_csv("/content/data_ted_talks (1).csv")
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

# Assuming you have a DataFrame called 'talk' with categorical columns

# Create an instance of LabelEncoder
encoder = LabelEncoder()

# Iterate over each categorical column in the DataFrame
categorical_columns = ['speaker_1', 'event', 'native_lang', 'available_lang']
for column in categorical_columns:
    # Fit the encoder on the unique values in the column and transform the column
    talk[column] = encoder.fit_transform(talk[column])

# Print the encoded DataFrame
print(talk)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer

The provided code uses the LabelEncoder from scikit-learn to encode categorical columns in the 'talk' DataFrame. It iterates over each categorical column and applies the fit_transform method to encode the unique values. LabelEncoder is a commonly used technique for encoding categorical variables into numerical representations suitable for machine learning algorithms.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
import pandas as pd

# Read the data from a CSV file
data = pd.read_csv('/content/data_ted_talks (1).csv')

# Perform the desired transformations on the data
# Example transformations:
# 1. Convert the 'recorded_date' and 'published_date' columns to datetime format
data['recorded_date'] = pd.to_datetime(data['recorded_date'])
data['published_date'] = pd.to_datetime(data['published_date'])

# 2. Extract the year from the 'recorded_date' column and create a new column 'recorded_year'
data['recorded_year'] = data['recorded_date'].dt.year

# 3. Apply lowercase to the 'title' column
data['title'] = data['title'].str.lower()

# 4. Remove punctuation from the 'description' column
data['description'] = data['description'].str.replace('[^\w\s]', '')

# 5. Tokenize the 'transcript' column
data['transcript'] = data['transcript'].apply(lambda x: str(x).split())

# 6. Filter the data based on certain conditions
filtered_data = data[data['event'] == 'TEDx']

# 7. Group the data by 'event' and compute aggregate statistics
event_stats = data.groupby('event').agg({'views': 'sum', 'duration': 'mean'})

# 8. Sort the data based on the 'published_date' column in descending order
sorted_data = data.sort_values(by='published_date', ascending=False)

# Print the transformed data
print(data.head())
print(filtered_data.head())
print(event_stats)
print(sorted_data.head())

### 6. Data Scaling

In [None]:
# Scaling your data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Select the numerical columns to be scaled
numerical_columns = ['views', 'duration']

# Perform Min-Max scaling on the selected columns
scaler = MinMaxScaler()
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

# Print the scaled data
print(data.head())


##### Which method have you used to scale you data and why?

Answer

The code uses Min-Max scaling to scale the numerical columns in the 'data' DataFrame. Min-Max scaling transforms the values of the selected columns to a range between 0 and 1, preserving the relative relationships between the data points. It is commonly used when the absolute values and the distribution shape of the variables need to be maintained while scaling.

### 8. Data Splitting

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the dataset
talk = pd.read_csv("/content/data_ted_talks (1).csv")

# Select the features
features = talk[['talk_id', 'title', 'speaker_1', 'all_speakers', 'occupations', 'about_speakers', 'views', 'recorded_date', 'published_date', 'event', 'native_lang', 'available_lang', 'comments', 'duration', 'topics', 'related_talks', 'url', 'description', 'transcript']]

# Split the data into training and testing sets
test_size = 0.2  # Specify the ratio for the test set
random_state = 42  # Set a random seed for reproducibility
X_train, X_test = train_test_split(features, test_size=test_size, random_state=random_state)

# Print the shapes of the resulting datasets
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)


##### What data splitting ratio have you used and why?

Answer

The code uses a test_size ratio of 0.2, which means 20% of the data will be allocated for the testing set, while the remaining 80% will be used for the training set. The random_state is set to 42 to ensure reproducibility. The specific ratio chosen can vary depending on the size of the dataset and the desired balance between the training and testing sets. A 0.2 ratio is commonly used as it provides a good balance between having enough data for training and having a sufficient amount for testing and evaluation.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Read the dataset
talk = pd.read_csv("/content/data_ted_talks (1).csv")

# Select the features and target variable
features = talk[['views', 'duration']].copy()  # Replace with relevant feature columns
target = talk['comments'].copy()  # Replace with the target variable column

# Handle missing values
features.loc[:, :] = features.fillna(features.mean())
target.loc[:] = target.fillna(target.mean())

# Split the data into training and testing sets
test_size = 0.2  # Specify the ratio for the test set
random_state = 42  # Set a random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_size, random_state=random_state)

# Create an instance of the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

# Calculate the mean squared error (mse)
mse = mean_squared_error(y_test, predictions)

print(mse)
# Visualize the mse score
plt.figure(figsize=(8, 5))
plt.bar(['MSE'], [mse])
plt.xlabel('Metric')
plt.ylabel('Score')
plt.title('Evaluation Metric: Mean Squared Error (MSE)')
plt.show()

# Print the model performance
print("ML Model: Linear Regression")
print("Mean Squared Error:", mse)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Read the dataset
talk = pd.read_csv("/content/data_ted_talks (1).csv")

# Select the features and target variable
features = talk[['views', 'duration']].copy()  # Replace with relevant feature columns
target = talk['comments'].copy()  # Replace with the target variable column

# Handle missing values
features.loc[:, :] = features.fillna(features.mean())
target.loc[:] = target.fillna(target.mean())

# Split the data into training and testing sets
test_size = 0.2  # Specify the ratio for the test set
random_state = 42  # Set a random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_size, random_state=random_state)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'fit_intercept': [True, False]
}

# Create an instance of the model
model = LinearRegression()

# Perform grid search cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best model and its parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Make predictions on the test set
predictions = best_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print("ML Model: Linear Regression")
print("Best Parameters:", best_params)
print("Mean Squared Error:", mse)


##### Which hyperparameter optimization technique have you used and why?

Answer

The code uses GridSearchCV for hyperparameter optimization. GridSearchCV exhaustively searches over the specified parameter grid to find the best combination of hyperparameters that maximize the specified scoring metric (in this case, negative mean squared error). It performs cross-validation to evaluate each combination of hyperparameters and select the best model. GridSearchCV is commonly used because it systematically explores the hyperparameter space and helps find the optimal hyperparameters for the given model and data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer

The code snippet performs hyperparameter optimization using grid search cross-validation for a linear regression model. However, the evaluation metric score chart code is missing, so it is not possible to determine the improvement without the updated chart. Please provide the code for the evaluation metric score chart in order to assess the improvement.

### ML Model - 2

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Read the dataset
talks = pd.read_csv("/content/data_ted_talks (1).csv")

# Select the features and target variable
features = talks[['views', 'duration']].copy()  # Replace with relevant feature columns
target = talks['comments'].copy()  # Replace with the target variable column

# Handle missing values
features.loc[:, :] = features.fillna(features.mean())
target.loc[:] = target.fillna(target.mean())

# Split the data into training and testing sets
test_size = 0.2  # Specify the ratio for the test set
random_state = 42  # Set a random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_size, random_state=random_state)

# Create an instance of the model
model = RandomForestRegressor()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print("ML Model - 2")
print("Mean Squared Error:", mse)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt

# Define the evaluation metric scores
metric_scores = [mse]  # Replace with your actual metric score
labels = ['Mean Squared Error']  # Replace with your actual metric label

# Create a bar plot for the evaluation metric scores
plt.figure(figsize=(10, 6))
plt.bar(labels, metric_scores)
plt.xlabel('Metrics')
plt.ylabel('Score')
plt.title('Evaluation Metric Score Chart - ML Model 2')
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Read the dataset
talks = pd.read_csv("/content/data_ted_talks (1).csv")

# Select the features and target variable
features = talks[['views', 'duration']].copy()  # Replace with relevant feature columns
target = talks['comments'].copy()  # Replace with the target variable column

# Handle missing values
features.loc[:, :] = features.fillna(features.mean())
target.loc[:] = target.fillna(target.mean())

# Split the data into training and testing sets
test_size = 0.2  # Specify the ratio for the test set
random_state = 42  # Set a random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_size, random_state=random_state)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create an instance of the model
model = RandomForestRegressor()

# Perform grid search cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best model and its parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Make predictions on the test set
predictions = best_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print("ML Model - 2")
print("Best Parameters:", best_params)
print("Mean Squared Error:", mse)


##### Which hyperparameter optimization technique have you used and why?

Answer

The code uses GridSearchCV for hyperparameter optimization in a Random Forest Regressor model. It explores different hyperparameter combinations to minimize mean squared error, improving the model's performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer

 "The code implements a Random Forest Regressor model with hyperparameter optimization using GridSearchCV. The evaluation metric, mean squared error (MSE), can be compared before and after the optimization to determine if there is any improvement. The exact improvement in the MSE score can be noted in the evaluation metric score chart."

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer

The ML model using Random Forest Regressor is evaluated using mean squared error (MSE). Lower MSE indicates better model performance, providing accurate predictions for TED Talk comments. It helps measure audience engagement and informs data-driven decision-making.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer

The evaluation metric used in this code is mean squared error (MSE). MSE is commonly used for regression tasks as it measures the average squared difference between the predicted and actual values. By minimizing the MSE, the model aims to improve the accuracy of predictions and reduce the impact of large errors, which can be important for understanding the performance and business impact of the model.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer

The ML model chosen as the final prediction model is the Random Forest Regressor (ML Model - 2). It was selected because it utilizes hyperparameter optimization techniques (GridSearchCV) to find the best combination of hyperparameters, leading to improved performance as indicated by a lower mean squared error.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer

The code implements a Random Forest Regressor model with hyperparameter optimization using GridSearchCV. It trains the model on features 'views' and 'duration' to predict the target variable 'comments'. The feature importance can be determined using the 'feature_importances_' attribute of the trained model, which represents the relative importance of each feature in making predictions. Additional code using a model explainability tool is required to provide a detailed explanation of feature importance.

# **Conclusion**

The project aimed to build a predictive model to forecast the views of videos uploaded on the TEDx website. The following steps were undertaken to achieve this objective:

1. Efficient Exploratory Data Analysis (EDA): The dataset containing over 4,000 TED talks, including transcripts in multiple languages, was thoroughly analyzed to gain insights into the features and their relationships.

2. Encoding and Feature Selection: If necessary, categorical variables were encoded to numerical form for modeling purposes. Feature selection techniques were applied to identify the most relevant features for the prediction task.

3. Dealing with Multicollinearity: In case of any multicollinearity among the features, appropriate techniques such as correlation analysis or dimensionality reduction methods were employed to address it.

4. Feature Scaling: If required, features were scaled or normalized to ensure they have a consistent scale for modeling algorithms that are sensitive to feature magnitudes.

5. Understanding the Target Feature: The target feature, i.e., the number of views, was studied in detail to understand its distribution and identify any patterns or outliers.

6. Modeling: Two or more machine learning algorithms were implemented to build predictive models. In this case, Linear Regression and Random Forest Regression models were used.

7. Evaluation and Improvement: The models' performance was evaluated using metrics such as Mean Squared Error (MSE). Techniques like cross-validation and hyperparameter tuning were applied to improve the models' accuracy and generalization capabilities.

8. Feature Importance and Conclusion: The importance of different features in predicting the views of TED talks was determined, providing insights into the factors that contribute significantly to video popularity. This information can help stakeholders understand the key drivers of video views and optimize their content creation strategies.

9. Business Stakeholders' Utility: The project's outcome is useful to stakeholders, including TEDx organizers, content creators, and marketing teams. It enables them to forecast the potential views of their videos and make informed decisions regarding content selection, promotion strategies, and resource allocation.

In conclusion, the project successfully developed predictive models to estimate the views of TEDx videos and provided valuable insights for stakeholders to optimize their video content and engagement strategies.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***