# TfidfVectorizer Explanation
Convert a collection of raw documents to a matrix of TF-IDF features

TF-IDF where TF means term frequency, and IDF means Inverse Document frequency.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ['Hello ShiyamK here, I love machine learning','Welcome to the Machine learning hub' ]

In [40]:
vect = TfidfVectorizer()

In [41]:
vect.fit(text)

In [42]:
## TF will count the frequency of word in each document. and IDF 
print(vect.idf_)

[1.40546511 1.40546511 1.40546511 1.         1.40546511 1.
 1.40546511 1.40546511 1.40546511 1.40546511]


In [43]:
print(vect.vocabulary_)

{'hello': 0, 'shiyamk': 6, 'here': 1, 'love': 4, 'machine': 5, 'learning': 3, 'welcome': 9, 'to': 8, 'the': 7, 'hub': 2}


### A words which is present in all the data, it will have low IDF value. With this unique words will be highlighted using the Max IDF values.

In [44]:
example = text[0]
example

'Hello ShiyamK here, I love machine learning'

In [45]:
example = vect.transform([example])
print(example.toarray())

[[0.44665616 0.44665616 0.         0.31779954 0.44665616 0.31779954
  0.44665616 0.         0.         0.        ]]


### Here, 0 is present in the which indexed word, which is not available in given sentence.

## PassiveAggressiveClassifier

### Passive: if correct classification, keep the model; Aggressive: if incorrect classification, update to adjust to this misclassified example.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few ‘online-learning algorithms‘. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example.

## Let's start the work

In [46]:
import pandas as pd

In [47]:
dataframe = pd.read_csv('news.csv')
dataframe.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [48]:
x = dataframe['text']
y = dataframe['label']

In [49]:
x

0       Daniel Greenfield, a Shillman Journalism Fello...
1       Google Pinterest Digg Linkedin Reddit Stumbleu...
2       U.S. Secretary of State John F. Kerry said Mon...
3       — Kaydee King (@KaydeeKing) November 9, 2016 T...
4       It's primary day in New York and front-runners...
                              ...                        
6330    The State Department told the Republican Natio...
6331    The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332     Anti-Trump Protesters Are Tools of the Oligar...
6333    ADDIS ABABA, Ethiopia —President Obama convene...
6334    Jeb Bush Is Suddenly Attacking Trump. Here's W...
Name: text, Length: 6335, dtype: object

In [50]:
y

0       FAKE
1       FAKE
2       REAL
3       FAKE
4       REAL
        ... 
6330    REAL
6331    FAKE
6332    FAKE
6333    REAL
6334    REAL
Name: label, Length: 6335, dtype: object

In [51]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [52]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
y_train

2402    REAL
1922    REAL
3475    FAKE
6197    REAL
4748    FAKE
        ... 
4931    REAL
3264    REAL
1653    FAKE
2607    FAKE
2732    REAL
Name: label, Length: 5068, dtype: object

In [53]:
y_train

2402    REAL
1922    REAL
3475    FAKE
6197    REAL
4748    FAKE
        ... 
4931    REAL
3264    REAL
1653    FAKE
2607    FAKE
2732    REAL
Name: label, Length: 5068, dtype: object

In [54]:
tfvect = TfidfVectorizer(stop_words='english',max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)
tfid_x_test = tfvect.transform(x_test)

* max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".

In [55]:
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfid_x_train,y_train)

In [56]:
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 93.21%


In [57]:
cf = confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
print(cf)

[[569  46]
 [ 40 612]]


In [58]:
def fake_news_det(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = classifier.predict(vectorized_input_data)
    print(prediction)

In [59]:
fake_news_det('U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.')

['REAL']


In [60]:
fake_news_det("""Go to Article 
President Barack Obama has been campaigning hard for the woman who is supposedly going to extend his legacy four more years. The only problem with stumping for Hillary Clinton, however, is sheâ€™s not exactly a candidate easy to get too enthused about.  """)

['FAKE']


In [61]:
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))

In [62]:
# load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))

In [63]:
def fake_news_det1(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = loaded_model.predict(vectorized_input_data)
    print(prediction)

In [64]:
fake_news_det1("""Go to Article 
President Barack Obama has been campaigning hard for the woman who is supposedly going to extend his legacy four more years. The only problem with stumping for Hillary Clinton, however, is sheâ€™s not exactly a candidate easy to get too enthused about.  """)

['FAKE']


In [65]:
fake_news_det1("""U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.""")

['REAL']


In [66]:
fake_news_det('''Terms of the tender stipulated that the contract be signed by June 1. As per the report, the contract will pay $1.8 billion for supplying the Vande Bharat trains and $2.5 billion for their maintenance for 35 years. Considering the indexation, the total value of the contract could be as much as $6.5 billion.

Addressing a press conference, Transmashholding's Chief Executive Officer (CEO) Kirill Lipa said, "The decision has been made, but the document itself has not been signed, (it will be signed) within 45 days from March 29." 

TMH took part in thein the tender in a consortium with the Indian state construction and engineering company RVNL. The Russian firm beat out proposals by companies such as Alstom, Stadler, Siemens, and a consortium of local producers led by Titagarh and BHEL, which bid the second-lowest.

The report further said that the production of the trains will be localised at the Indian Railways, Marathwada Rail Coach Factory, in Latur, Maharashtra. The delivery of the trains is expected to take place between 2026 to 2030. But the first two prototypes will be ready for trials by the end of 2025. ''')

['REAL']


In [67]:
fake_news_det('''ChatGPT is a natural language processing tool driven by AI technology that allows you to have human-like conversations and much more with the chatbot. The language model can answer questions and assist you with tasks, such as composing emails, essays, and code.''')

['FAKE']


In [68]:
fake_news_det('''ChatGPT is scary good. We are not far from dangerously strong AI," said Elon Musk, who was one of the founders of OpenAI before leaving. Sam Altman, OpenAI's chief, said on Twitter that ChatGPT had more than one million users in the first five days after it launched. 

According to analysis by Swiss bank UBS, ChatGPT is the fastest-growing app of all time. The analysis estimates that ChatGPT had 100 million active users in January, only two months after its launch. For comparison, it took nine months for TikTok to reach 100 million users.''')

['FAKE']


In [70]:
fake_news_det('''Ten days after he appointed new campaign leadership, Donald Trump and many of his closest aides and allies remain divided on whether to adopt more mainstream stances or stick with the hard-line conservative positions at the core of his candidacy, according to people involved in the discussions.

Trump has been flooded with conflicting advice about where to land, with the tensions vividly illustrated this week as the GOP nominee publicly wrestled with himself on the details of his signature issue: immigration.''')

['REAL']


In [71]:
fake_news_det('''ROME – If anyone suspected that Pope Francis didn’t really mean the strong words he spoke on religious freedom last week in the United States – that he was phoning it in, while his real concerns were elsewhere – claims that he held a private meeting with Kentucky county clerk Kim Davis certainly should lay that suspicion to rest.

The meeting was first reported by Robert Moynihan of Inside the Vatican magazine. A Vatican spokesman said Wednesday, “I do not deny that the meeting took place, but I have no comments to add,” which, in effect, is a way of allowing the report to stand.''')

['REAL']
