<a href="https://colab.research.google.com/github/vipratiwari/Algorithems/blob/master/read_news_articles_and_predict_categories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, we will be using Dask to load and preprocess a large dataset of online news articles, and then using Scikit-Learn to train a machine learning model to predict the category of the article based on its content.

In [1]:
!pip install dask scikit-learn pandas nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Load the dataset

In [6]:
import dask.dataframe as dd

# Download dataset
!wget https://storage.googleapis.com/dask-tutorial-data/notebook-data/online-news-popularity/online-news-popularity.csv

# Load dataset with Dask
df = dd.read_csv('online-news-popularity.csv')

Preprocess the dataset

In [None]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Tokenize the article content and remove stop words
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return tokens

df['tokens'] = df['content'].apply(tokenize)

# Create bag-of-words features
cv = CountVectorizer(max_features=10000)
X = cv.fit_transform(df['tokens'].compute().apply(' '.join))
y = df['shares']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Define a Scikit-Learn model

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

Train the model using Dask for parallel processing

In [None]:
from dask_ml.model_selection import GridSearchCV

parameters = {'normalize': [True, False]}
grid_search = GridSearchCV(model, parameters, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

Evaluate the model on the test set

In [None]:
from sklearn.metrics import r2_score
y_pred = grid_search.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.2f}')