## Detailed Article Explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/542463/text-classification-and-summarization-with-qwen-2-5-model-from-hugging-face

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Installing and Importing Required Libraries

In [3]:
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl



In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
from sklearn.metrics import accuracy_score
from rouge_score import rouge_scorer

## A Basic Example of Using Qwen 2.5 Instruct Model in Hugging Face

### Importing the Model and Tokenizer from Hugging Face

In [29]:

model_name = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Generating a Response from the Qwen 2.5 Model

In [30]:

system_prompt = "You are an expert Python coder"

user_prompt = "Give me a Python recursive function calculate factorial of a number"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)


<|im_start|>system
You are an expert Python coder<|im_end|>
<|im_start|>user
Give me a Python recursive function calculate factorial of a number<|im_end|>
<|im_start|>assistant



In [31]:
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Certainly! The factorial of a non-negative integer \( n \) is the product of all positive integers less than or equal to \( n \). It is denoted by \( n! \) and is defined as:

\[ n! = n \times (n-1) \times (n-2) \times \ldots \times 1 \]

For example, \( 5! = 5 \times 4 \times 3 \times 2 \times 1 = 120 \).

A recursive function to calculate the factorial can be defined as follows:

\[ n! = \begin{cases} 
1 & \text{if } n = 0 \\
n \times (n-1)! & \text{if } n > 0 
\end{cases} \]

Here's the Python code for this recursive function:

```python
def factorial(n):
    # Base case: if n is 0, return 1
    if n == 0:
        return 1
    # Recursive case: n * factorial(n - 1)
    else:
        return n * factorial(n - 1)

# Example usage:
print(factorial(5))  # Output should be 120
```

This function works by repeatedly calling itself with decreasing values of \( n \) until it reaches the base case where \( n \) is 0. At that point, it starts returning values back up the call stack, multiplyin

In [32]:
def factorial(n):
    # Base case: factorial of 0 or 1 is 1
    if n == 0 or n == 1:
        return 1
    # Recursive case: n * factorial of (n-1)
    else:
        return n * factorial(n-1)

# Example usage:
number = 6
print(f"The factorial of {number} is {factorial(number)}")

The factorial of 6 is 720


## Text Classification with Qwen 2.5

### Importing and Preprocessing the Dataset

In [33]:
## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"/content/Tweets.csv")
dataset.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [34]:
# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())


airline_sentiment
neutral     34
positive    33
negative    33
Name: count, dtype: int64


### Predicting Tweets Sentiment with Qwen 2.5

In [35]:
def generate_model_response(system_prompt, user_prompt):

  messages = [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": user_prompt}
  ]

  text = tokenizer.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=True
  )
  model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

  generated_ids = model.generate(
      **model_inputs,
      max_new_tokens=1048

  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

  return response

In [36]:


def find_sentiment(dataset):

    tweets_list = dataset["text"].tolist()

    all_sentiments = []

    i = 0
    exceptions = 0
    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]

            system_prompt = "You are an expert in annotating tweets with positive, negative, and neutral emotions"

            user_prompt = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            sentiment_value = generate_model_response(system_prompt, user_prompt)
            all_sentiments.append(sentiment_value)
            i = i + 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)
            exceptions += 1

    print("Total exception count:", exceptions)
    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print("Accuracy:", accuracy)

find_sentiment(dataset)

1 neutral
2 neutral
3 neutral
4 neutral
5 neutral
6 neutral
7 neutral
8 neutral
9 negative
10 neutral
11 neutral
12 neutral
13 neutral
14 neutral
15 negative
16 neutral
17 neutral
18 neutral
19 neutral
20 neutral
21 positive
22 neutral
23 neutral
24 neutral
25 positive
26 neutral
27 neutral
28 neutral
29 neutral
30 neutral
31 neutral
32 negative
33 neutral
34 neutral
35 neutral
36 positive
37 positive
38 neutral
39 positive
40 neutral
41 positive
42 positive
43 positive
44 positive
45 positive
46 positive
47 positive
48 positive
49 neutral
50 neutral
51 neutral
52 positive
53 positive
54 positive
55 positive
56 positive
57 positive
58 neutral
59 positive
60 neutral
61 mixed
62 positive
63 positive
64 positive
65 positive
66 positive
67 positive
68 neutral
69 negative
70 negative
71 negative
72 neutral
73 negative
74 negative
75 neutral
76 negative
77 negative
78 negative
79 negative
80 negative
81 negative
82 neutral
83 neutral
84 negative
85 negative
86 negative
87 negative
88 negativ

## Text Summarization with with Qwen 2.5

### Importing the Dataset


In [37]:
# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx

dataset = pd.read_excel(r"/content/dataset.xlsx")
dataset = dataset.sample(frac=1)
print(dataset.shape)
dataset.head()

(1000, 10)


Unnamed: 0.1,Unnamed: 0,id,human_summary,publication,author,date,year,month,theme,content
94,0,17400,Or the benefits of cities themselves have impr...,New York Times,Emily Badger,2017-01-12,2017.0,1.0,business,Everyone has theories for why professional...
160,0,17481,"In reality, though, making safes is a hushed t...",New York Times,Michael Wilson,2017-01-10,2017.0,1.0,business,They read like descriptions of props from the ...
498,259,17880,President Trump upended America’s traditional...,New York Times,Peter Baker,2017-01-24,2017.0,1.0,politics,WASHINGTON — President Trump upended Americ...
674,259,18072,Mr. Trump appeared to have more pleasant excha...,New York Times,Charles McDermid,2017-01-29,2017.0,1.0,politics,Good morning. Here’s what you need to know: •...
129,0,17445,Dr. Yamanaka’s method is now routinely used to...,New York Times,Nicholas Wade,2017-04-14,2017.0,4.0,science,"At the Salk Institute in La Jolla, Calif. scie..."


### Summarizing News Articles with Qwen 2.5

In [38]:
# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

In [39]:
# Function to generate summary using OpenAI API
def generate_summary(dataset):

    results = []

    i = 0

    for _, row in dataset[:20].iterrows():
      article = row['content']
      human_summary = row['human_summary']

      i +=1
      print(f"Summarizing article {i}")

      system_prompt = "You are an expert in summarizing news articles"
      user_prompt = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"

      generated_summary = generate_model_response(system_prompt, user_prompt)

      rouge_scores = calculate_rouge(human_summary, generated_summary)

      results.append({
          'article_id': row.id,
          'generated_summary': generated_summary,
          'rouge1': rouge_scores['rouge1'],
          'rouge2': rouge_scores['rouge2'],
          'rougeL': rouge_scores['rougeL']
      })

    return results

results = generate_summary(dataset)

results_df = pd.DataFrame(results)

mean_values = results_df[["rouge1", "rouge2", "rougeL"]].mean()
print(mean_values)

Summarizing article 1
Summarizing article 2
Summarizing article 3
Summarizing article 4
Summarizing article 5
Summarizing article 6
Summarizing article 7
Summarizing article 8
Summarizing article 9
Summarizing article 10
Summarizing article 11
Summarizing article 12
Summarizing article 13
Summarizing article 14
Summarizing article 15
Summarizing article 16
Summarizing article 17
Summarizing article 18
Summarizing article 19
Summarizing article 20
rouge1    0.325830
rouge2    0.068624
rougeL    0.168639
dtype: float64
