## Detailed Article Explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/542539/qwen-vs-llama-who-is-winning-the-open-source-llm-race

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Installing and Importing Required Libraries

In [None]:
!pip install huggingface_hub==0.24.7
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl

^C


In [1]:
from huggingface_hub import InferenceClient
import os
import pandas as pd
from rouge_score import rouge_scorer
from sklearn.metrics import accuracy_score
from collections import defaultdict

  from pandas.core import (


## Calling Qwen 2.5 and Llama 3.1 Using Hugging Face Inference API

In [2]:
hf_token = os.environ.get('HF_TOKEN') 

#qwen 2.5 endpoint
#https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
qwen_model_client = InferenceClient(
    "Qwen/Qwen2.5-72B-Instruct",
    token=hf_token
)

#Llama 3.1 endpoint
#https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
llama_model_client = InferenceClient(
    "meta-llama/Llama-3.1-70B-Instruct",
    token=hf_token
)


In [3]:
def make_prediction(model, system_role, user_query):
    
    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=10,
    )
         
    return response.choices[0].message.content

In [4]:
system_role = "Assign positive, negative, or neutral sentiment to the movie review. Return only a single word in your response"
user_query = "I like this movie a lot"
make_prediction(qwen_model_client,
               system_role,
               user_query)

'positive'

In [10]:
system_role = "Assign positive, negative, or neutral sentiment to the movie review. Return only a single word in your response"
user_query = "I hate this movie a lot"
make_prediction(llama_model_client,
               system_role,
               user_query)

'Negative'

## Qwen 2.5-72b vs Llama 3.1-70b For Text Classification

In [6]:
## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"D:\Datasets\Tweets.csv")
dataset.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [7]:
# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

airline_sentiment
neutral     34
positive    33
negative    33
Name: count, dtype: int64


In [8]:
def predict_sentiment(model, system_role, user_query):
    
    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=10,
    )
         
    return response.choices[0].message.content

In [9]:
models = {
    "qwen2.5-72b": qwen_model_client,
    "llama3.1-70b": llama_model_client
}

tweets_list = dataset["text"].tolist()
all_sentiments = []
exceptions = 0

for i, tweet in enumerate(tweets_list, 1):
    for model_name, model_client in models.items():
        try:
            print(f"Processing tweet {i} with model {model_name}")

            system_role = "You are an expert in annotating tweets with positive, negative, and neutral emotions"

            user_query = (
                f"What is the sentiment expressed in the following tweet about an airline? "
                f"Select sentiment value from positive, negative, or neutral. "
                f"Return only the sentiment value in small letters.\n\n"
                f"tweet: {tweet}"
            )

            sentiment_value = predict_sentiment(model_client, system_role, user_query)
            all_sentiments.append({
                'tweet_id': i,
                'model': model_name,
                'sentiment': sentiment_value
            })
            print(i, model_name, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred with model:", model_name, "| Tweet:", i, "| Error:", e)
            exceptions += 1

print("Total exception count:", exceptions)


Processing tweet 1 with model qwen2.5-72b
1 qwen2.5-72b neutral
Processing tweet 1 with model llama3.1-70b
1 llama3.1-70b neutral
Processing tweet 2 with model qwen2.5-72b
2 qwen2.5-72b neutral
Processing tweet 2 with model llama3.1-70b
2 llama3.1-70b neutral
Processing tweet 3 with model qwen2.5-72b
3 qwen2.5-72b neutral
Processing tweet 3 with model llama3.1-70b
3 llama3.1-70b neutral
Processing tweet 4 with model qwen2.5-72b
4 qwen2.5-72b neutral
Processing tweet 4 with model llama3.1-70b
4 llama3.1-70b neutral
Processing tweet 5 with model qwen2.5-72b
5 qwen2.5-72b neutral
Processing tweet 5 with model llama3.1-70b
5 llama3.1-70b neutral
Processing tweet 6 with model qwen2.5-72b
6 qwen2.5-72b negative
Processing tweet 6 with model llama3.1-70b
6 llama3.1-70b neutral
Processing tweet 7 with model qwen2.5-72b
7 qwen2.5-72b positive
Processing tweet 7 with model llama3.1-70b
7 llama3.1-70b neutral
Processing tweet 8 with model qwen2.5-72b
8 qwen2.5-72b neutral
Processing tweet 8 with 

In [11]:
results_df = pd.DataFrame(all_sentiments)
for model_name in models.keys():
    model_results = results_df[results_df['model'] == model_name]
    accuracy = accuracy_score(model_results['sentiment'], dataset["airline_sentiment"].iloc[:len(model_results)])
    print(f"Accuracy for {model_name}: {accuracy}")

Accuracy for qwen2.5-72b: 0.8
Accuracy for llama3.1-70b: 0.77


## Qwen 2.5-72b vs Llama 3.1-70b For Text Summarization

In [12]:
# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx

dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset = dataset.sample(frac=1)
print(dataset.shape)
dataset.head()

(1000, 10)


Unnamed: 0.1,Unnamed: 0,id,human_summary,publication,author,date,year,month,theme,content
883,259,18334,the new york times • in bel air the most expen...,New York Times,Mike McPhate,2017-02-06,2017.0,2.0,science,Good morning. (Want to get California Today by...
38,0,17333,Donald J. Trump on Tuesday named as his chief ...,New York Times,Binyamin Appelbaum,2017-01-04,2017.0,1.0,politics,WASHINGTON — Donald J. Trump on Tuesday n...
773,259,18209,Both the coal and rules were made final in the...,New York Times,Hiroko Tabuchi,2017-02-03,2017.0,2.0,politics,Republicans on Thursday took one of their firs...
806,259,18246,Due to some of the provocations out of north k...,New York Times,Michael R. Gordon and Motoko Rich,2017-02-04,2017.0,2.0,politics,TOKYO — Defense Secretary Jim Mattis assure...
862,259,18311,The rest of the money goes toward the fledglin...,New York Times,Clair MacDougall,2017-02-06,2017.0,2.0,crime,"MONROVIA, Liberia — Emmanuel Dongo, who spe..."


In [13]:
dataset['summary_length'] = dataset['human_summary'].apply(len)
average_length = dataset['summary_length'].mean()
print(f"Average length of summaries: {average_length:.2f} characters")

Average length of summaries: 1168.78 characters


In [14]:
def generate_summary(model, system_role, user_query):
    
    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=1200,
    )
         
    return response.choices[0].message.content

In [15]:
# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

In [16]:
models = {"qwen2.5-72b": qwen_model_client,
          "llama3.1-70b": llama_model_client}

results = []

i = 0
for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']
    
    i = i + 1
    
    for model_name, model_client in models.items():
        
        print(f"Summarizing article {i} with model {model_name}")
        system_role = "You are an expert in creating summaries from text"
        user_query = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"
        
        generated_summary = generate_summary(model_client, system_role, user_query)
        rouge_scores = calculate_rouge(human_summary, generated_summary)
        
        results.append({
            'model': model_name,
            'article_id': row.id,
            'generated_summary': generated_summary,
            'rouge1': rouge_scores['rouge1'],
            'rouge2': rouge_scores['rouge2'],
            'rougeL': rouge_scores['rougeL']
        })

# Create a DataFrame with results
results_df = pd.DataFrame(results)

Summarizing article 1 with model qwen2.5-72b
Summarizing article 1 with model llama3.1-70b
Summarizing article 2 with model qwen2.5-72b
Summarizing article 2 with model llama3.1-70b
Summarizing article 3 with model qwen2.5-72b
Summarizing article 3 with model llama3.1-70b
Summarizing article 4 with model qwen2.5-72b
Summarizing article 4 with model llama3.1-70b
Summarizing article 5 with model qwen2.5-72b
Summarizing article 5 with model llama3.1-70b
Summarizing article 6 with model qwen2.5-72b
Summarizing article 6 with model llama3.1-70b
Summarizing article 7 with model qwen2.5-72b
Summarizing article 7 with model llama3.1-70b
Summarizing article 8 with model qwen2.5-72b
Summarizing article 8 with model llama3.1-70b
Summarizing article 9 with model qwen2.5-72b
Summarizing article 9 with model llama3.1-70b
Summarizing article 10 with model qwen2.5-72b
Summarizing article 10 with model llama3.1-70b
Summarizing article 11 with model qwen2.5-72b
Summarizing article 11 with model llama3.1

In [17]:
average_scores = results_df.groupby('model')[['rouge1', 'rouge2', 'rougeL']].mean()
average_scores_sorted = average_scores.sort_values(by='rouge1', ascending=False)
print("Average ROUGE scores by model:")
average_scores_sorted.head()

Average ROUGE scores by model:


Unnamed: 0_level_0,rouge1,rouge2,rougeL
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
qwen2.5-72b,0.377589,0.096248,0.186228
llama3.1-70b,0.337821,0.082739,0.174995
