<a href="https://colab.research.google.com/github/riyajaiswal25/SVM/blob/main/Textclassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('BBC News Train.csv')

In [3]:
df

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business
...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment
1487,1590,weak dollar hits reuters revenues at media gro...,business
1488,1587,apple ipod family expands market apple has exp...,tech


In [4]:
df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.0+ KB


In [6]:
df.describe()

Unnamed: 0,ArticleId
count,1490.0
mean,1119.696644
std,641.826283
min,2.0
25%,565.25
50%,1112.5
75%,1680.75
max,2224.0


In [7]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [8]:
# Separate features (X) and target variable (y)
X = df['Text']
y = df['Category']

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Convert text data to numerical features using TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000) #The TF-IDF vectorizer is used to convert the text data into numerical features, and a linear SVM model is trained for classification.
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [11]:
# Create SVM model
svm_model = SVC(kernel='linear')

In [12]:
# Train the model
svm_model.fit(X_train_tfidf, y_train)

In [13]:
# Make predictions on the test set
y_pred = svm_model.predict(X_test_tfidf)

In [14]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
report = classification_report(y_test, y_pred)

In [15]:
print("Accuracy:", accuracy)
print("F1-Score:", f1)
print("Classification Report:\n", report)

Accuracy: 0.9798657718120806
F1-Score: 0.9799625431329347
Classification Report:
                precision    recall  f1-score   support

     business       0.96      0.97      0.97        75
entertainment       1.00      0.98      0.99        46
     politics       0.95      0.96      0.96        56
        sport       1.00      1.00      1.00        63
         tech       1.00      0.98      0.99        58

     accuracy                           0.98       298
    macro avg       0.98      0.98      0.98       298
 weighted avg       0.98      0.98      0.98       298



# Model Performance Metrics:
**Accuracy**: 97.99%
**F1-Score**: 97.99%

# Conclusion:
**Overall Performance**: The model demonstrates high accuracy and F1-score, indicating that it performs well in classifying news articles into different categories.

# Class-specific Performance:

**Sport Category**: Achieves perfect precision, recall, and F1-score, indicating that the model accurately identifies all sports articles.

**Entertainment and Tech Categories**: Also show high precision, recall, and F1-score, suggesting effective classification in these domains.

**Business and Politics Categories**: While slightly lower, precision, recall, and F1-score are still impressive, showcasing the model's ability to distinguish between these categories.

**Weighted and Macro Averages**: The weighted average and macro average metrics provide an overall view of the model's performance across all classes, considering the class imbalances. The values are high, indicating consistent and robust classification across different categories.


The SVM model is effective in categorizing news articles into topics, demonstrating particularly strong performance in sports, entertainment, and tech categories. The high accuracy and F1-score suggest that the model can be valuable for automated topic classification in news articles. Consider further tuning or experimentation with hyperparameters if needed, and ensure that the dataset used for testing is representative of real-world scenarios.