# Let's Import some Libary :

In [1]:
import os

%matplotlib inline
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    ConfusionMatrixDisplay,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.naive_bayes import MultinomialNB
from collections import Counter
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC

# Import Data load :

In [2]:
df = pd.read_csv("Articles.csv",encoding='ISO-8859-1')

In [3]:
df

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business
...,...,...,...,...
2687,strong>DUBAI: Dubai International Airport and ...,3/25/2017,Laptop ban hits Dubai for 11m weekend traveller,business
2688,"strong>BEIJING: Former Prime Minister, Shaukat...",3/26/2017,Pak China relations not against any third coun...,business
2689,strong>WASHINGTON: Uber has grounded its fleet...,3/26/2017,Uber grounds self driving cars after accid,business
2690,strong>BEIJING: The New Development Bank plans...,3/27/2017,New Development Bank plans joint investments i...,business


# Let's  Understanding data:

**Ask some  Quetion MY self:**

## 1) How big is the data?


In [4]:
print("Number of Rows :", df.shape[0])
print("Number of Columns :",df.shape[1])

Number of Rows : 2692
Number of Columns : 4


##  2) How dose the data look like?

In [5]:
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [6]:
df.sample(5)

Unnamed: 0,Article,Date,Heading,NewsType
2548,strong>ISLAMABAD: Pakistan and Russia have hel...,12/14/2016,Pakistan Russia hold consultations on economic...,business
1764,DUBAI: England will aim to narrow the gap wit...,5/18/2016,Sri Lanka eyes sixth spot with series win agai...,sports
1664,strong>ISLAMABAD: Veteran players of Pakistan ...,4/27/2016,Pak India hockey veterans set for action in May,sports
46,ISLAMABAD: Federal Finance Minister Ishaq Dar ...,2/12/2015,cnic number now tax number only companies allo...,business
2043,strong>LONDON: England's one-day captain Eoin ...,6/21/2016,Morgan Englands winning momentum Sri L,sports


## 3) Waht is the data type of cols ? 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2692 entries, 0 to 2691
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Article   2692 non-null   object
 1   Date      2692 non-null   object
 2   Heading   2692 non-null   object
 3   NewsType  2692 non-null   object
dtypes: object(4)
memory usage: 84.2+ KB


## 4) Are there any missing value ?

In [8]:
df.isnull().sum()

Article     0
Date        0
Heading     0
NewsType    0
dtype: int64

*  every value are not_null in this data so they  data will  be going data preprocesseing..

## 5) Are there duplicate values in dataset ?

In [9]:
df.duplicated().sum()
#if data in duplication value first of all duplicat value are remove in data set:

107

In [10]:
df.drop_duplicates()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business
...,...,...,...,...
2669,strong>DUBAI: Dubai International Airport and ...,3/25/2017,Laptop ban hits Dubai for 11m weekend traveller,business
2670,"strong>BEIJING: Former Prime Minister, Shaukat...",3/26/2017,Pak China relations not against any third coun...,business
2671,strong>WASHINGTON: Uber has grounded its fleet...,3/26/2017,Uber grounds self driving cars after accid,business
2690,strong>BEIJING: The New Development Bank plans...,3/27/2017,New Development Bank plans joint investments i...,business


## 6) How is the correlation between columns ?

In [11]:

df.corr()

In [12]:
# Printing columns in the data sets
print(df.columns)

Index(['Article', 'Date', 'Heading', 'NewsType'], dtype='object')


In [13]:
print(df['NewsType'].value_counts())

sports      1408
business    1284
Name: NewsType, dtype: int64


**in the  dataset contains 1408 sports articles and 1284 business articles**

In [14]:
# removing the date columns since it is not a useful feature to classify articles
df=df.drop(columns=['Date'])

In [15]:
df

Unnamed: 0,Article,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,us oil prices slip below 50 a barr,business
...,...,...,...
2687,strong>DUBAI: Dubai International Airport and ...,Laptop ban hits Dubai for 11m weekend traveller,business
2688,"strong>BEIJING: Former Prime Minister, Shaukat...",Pak China relations not against any third coun...,business
2689,strong>WASHINGTON: Uber has grounded its fleet...,Uber grounds self driving cars after accid,business
2690,strong>BEIJING: The New Development Bank plans...,New Development Bank plans joint investments i...,business


In [16]:
'''This cell collects words in a row of two columns, heading and article, creates a word list
and stores it in 'words' list variable'''
words=list()
arr=[]
for i,row in df.iterrows():
    temp=row['Heading']+' '+row['Article']
    words.append(temp.split(' '))

In [17]:
print('\'words\' variable contains for example: ')
print(words[0])

'words' variable contains for example: 
['sindh', 'govt', 'decides', 'to', 'cut', 'public', 'transport', 'fares', 'by', '7pc', 'kti', 'rej', 'KARACHI:', 'The', 'Sindh', 'government', 'has', 'decided', 'to', 'bring', 'down', 'public', 'transport', 'fares', 'by', '7', 'per', 'cent', 'due', 'to', 'massive', 'reduction', 'in', 'petroleum', 'product', 'prices', 'by', 'the', 'federal', 'government,', 'Geo', 'News', 'reported.Sources', 'said', 'reduction', 'in', 'fares', 'will', 'be', 'applicable', 'on', 'public', 'transport,', 'rickshaw,', 'taxi', 'and', 'other', 'means', 'of', 'traveling.Meanwhile,', 'Karachi', 'Transport', 'Ittehad', '(KTI)', 'has', 'refused', 'to', 'abide', 'by', 'the', 'government', 'decision.KTI', 'President', 'Irshad', 'Bukhari', 'said', 'the', 'commuters', 'are', 'charged', 'the', 'lowest', 'fares', 'in', 'Karachi', 'as', 'compare', 'to', 'other', 'parts', 'of', 'the', 'country,', 'adding', 'that', '80pc', 'vehicles', 'run', 'on', 'Compressed', 'Natural', 'Gas', '(CNG

In [18]:
# doing some necessary cleaning in the 'words' list
for i in range(len(words)):
    for j in range(len(words[i])):
        words[i][j]=words[i][j].replace(':','')
        if not words[i][j].isalpha():
            words[i][j]=''

In [19]:
# counting words and storing it in a dictionary format: 'word':'occurence number'
words_dict=Counter()
for i in range(len(words)):
    words_dict+=Counter(words[i])

In [20]:
type(Counter(words[0]))

collections.Counter

In [21]:
#deleting dictionary key where key is '' (Blenk)
del words_dict['']


In [22]:
# words_dict contains 25494 key-value pairs
len(words_dict)

25494

In [23]:
# Taking out most common 3500 words out of 25494
# we will use these 3500 words to train our model
words_dict=words_dict.most_common(3500)


In [24]:
# feature engineering
features=[]
for i in range(len(words)):
    t=words[i]
    data=[]
    for i in words_dict:
        data.append(t.count(i[0]))
    features.append(data)

In [25]:
# Dependent variable 'x'
x=np.array(features)

In [29]:
x[0]

array([5, 5, 4, ..., 0, 0, 0])

In [30]:
x.shape

(2692, 3500)

In [31]:
# Since we need to predict the class of the article, NewsType will be our target variable
df['NewsType']=df['NewsType'].replace({'sports':0,'business':1})

target=df['NewsType'].iloc[:].values

In [32]:
target.shape

(2692,)

In [33]:
# Target Variable
y=np.array(target)

In [34]:
classifier=MultinomialNB()

In [35]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=7)

In [36]:
classifier.fit(x_train,y_train)

In [37]:
y_pred=classifier.predict(x_test)

In [39]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)*100

99.25788497217069

In [51]:
def checker(heading,body):
    """Pass a heading, and body of the article to use this function.
        Returns whether article is sports or business."""
    temp=heading+' '+body
    t=temp.split(' ')
    data=[]
    for i in words_dict:
        data.append(t.count(i[0]))
    news_tyep=classifier.predict(np.array(data).reshape(1,3500))[0]
    if news_tyep==0:
        return 'sports'
    if news_tyep==1:
        return 'business'

In [57]:
heading= """NORTH SOUND, Antigua: Virat Kohli continued his efficient and energetic accumulation of runs to bring up his maiden first-class double hundred, which took India past 400 in the first session of the second day.This was the first away double-century by an India captain. In a wicketless session that went for 102 runs, R Ashwin enjoyed some luck but also displayed some attractive shots; a back-foot punch for four through mid-on was the shot of the session. The partnership between the two swelled to 168.As was the case on day one, Gabriel bowled a short burst with the new ball, which was taken first thing in the morning. He tested Ashwin, who was dropped on 43 by Dowrich, who looked like he was already thinking of celebrating even before completing the simple catch. Jason Holder looked unremarkable at the other end; Carlos Brathwaite did a better job of testing the batsmen's patience with consistent bowling a set of stumps outside off stump.The tactic had kept India's top order quiet in the first session of the match, but Kohli didn't wait for too long before taking a calculated risk, which was executed so well it didn't look like a risk. Kohli has his own way of choosing what balls to drive. Each delivery of Brathwaite's first three overs was bowled to Kohli, who attempted to score off only one, the widest of the lot. It wasn't a half-volley either, but Kohli drove superbly on the up, and got a boundary to break any pressure the joining of dots creates.Cover driving, as usual, remained the feature of Kohli's innings. When he gave the treatment to Devendra Bishoo in the 105th over, the boundary took him past his previous best of 169; it was his 50th run through the covers. A sign of how well he batted came in how, in the 113th over, he played perhaps the only ungainly shot of his innings, a half-sweep across the line to deep midwicket. Turned out he had picked the rare wrong'un from Bishoo, and was actually playing with the spin.No Test double is easy, but in the last over before lunch, Kohli strolled to one of the more inevitable ones with an easy single off a short offbreak. Ashwin, who has two centuries against West Indies to his name, was now looking comfortable with 64 runs at a Test average of 67.80 against West Indies. 
"""
body="""Kohlis first double takes India past 400 against Windi
"""

In [59]:
news_tyep=checker(heading,body)

In [60]:
news_tyep

'sports'