# Sample article
WASHINGTON — The United States is sending to Ukraine up to $400 million in additional military equipment and supplies, including four more medium-range rocket systems and ammunition, as the embattled nation tries to repel Russia’s advances in the Donbas region.
The four additional M142 High Mobility Artillery Rocket Systems, or HIMARS, will bring the total number sent to Ukraine to a dozen, a senior defense official told reporters in a briefing Friday. The official said the first eight HIMARS were particularly useful for Ukraine, as the fight in the Donbas has largely evolved into an artillery duel. The official refuted Russian reports that two of the delivered HIMARS were destroyed, and said all eight are accounted for and still in use by Ukraine.
The military equipment being drawn down from U.S. stockpiles and sent to Ukraine also includes three tactical vehicles, demolition munitions, counter-battery systems and spare parts, among other equipment, so Ukraine can repair and maintain other systems that allies have sent in recent months.
The shipment will also include 1,000 rounds of 155mm artillery ammunition, which the defense official described as a precision-guided type that would allow the Ukrainian military to better hit specific targets, which would save ammunition. The official would not confirm whether these shells will be the guided Excalibur artillery rounds, but said they have not been part of previous security assistance packages to Ukraine.
HIMARS is a light, wheeled multiple rocket launcher, which Pentagon officials previously said was a “top priority” request by Ukraine. The U.S. undersecretary for defense for policy, Colin Kahl, told reporters last month that HIMARS allows Ukrainian forces to strike targets with greater range and precision than other artillery weapons that were sent.
Ukrainian President Volodymyr Zelenskyy formally promised only to use HIMARS for defensive purposes and to avoid firing into Russian territory; this took place before the U.S. agreed to provide the systems in order to avoid escalating the conflict.
The defense official said Russian claims HIMARS were used in strikes outside of Ukrainian territory are false, and that Russian forces, capabilities and logistics nodes within Ukraine are “absolutely fair targets.”
The official said the weekslong process to train Ukrainian troops on how to use the high-end HIMARS platform has been a limiting factor, and is why they were delivered in batches of four at a time. The official said efforts to train more Ukrainians on HIMARS will continue, but would not say how many have so far been trained.
The official said the HIMARS would arrive on the battlefield “rapidly,” but would not say how long their deployment might take.
The official said Russian forces are making “very incremental, limited, hard-fought, highly costly progress” in some parts of Donbas, and that they are far behind their timelines and objectives. The official would not specify where Russian forces are believed to have been disrupted, but said they are behind the front lines in Donbas.
Ukrainian forces are launching effective counteroffenses, the official said, and in the last week have started to use HIMARS strikes to seriously disrupt Russia’s ability to gain ground.
“We don’t see this at all as Russia winning this battle,” the official said. “Certainly they’re not winning it relative to their initial objectives. They’ve been very much thwarted, but the fighting is hard.”
The U.S. has been talking with allies and partners about other systems that could be sent to Ukraine, such as coastal defense capabilities, to move the nation away from Soviet legacy systems.
While Ukraine has received a great deal of equipment from the U.S. and other partner nations, the official said, its military has been using it at such an intense pace that forces need resources to repair and sustain those systems.
Providing this ability also sends Russia an important signal that Ukraine will be able to continue the fight, the official said.
“If the Russians think they can outlast the Ukrainians, they need to rethink that,” the official said. “We are already pivoting towards thinking about what the Ukrainians will need in the months and years ahead.”

In [361]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
nltk.download('averaged_perceptron_tagger') #need this on first run
from nltk.corpus import wordnet

orig_stopWords = set(stopwords.words("english"))
stopWords_list = [',', '.', "'s", "mr", "-", '``', "''", "said", '$', 'month', 'months', 'year', 'years', 'date', 'dates', 'official', 'Ha']
stopWords = orig_stopWords.union(stopWords_list)

from sklearn.feature_extraction.text import TfidfVectorizer


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/timothywee/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [362]:
def summariser(text):

    #tokenize text
    words = word_tokenize(text)

    #Creating a dictionary of words and their frequencies
    freqTable = dict()
    for word in words:
        word = word.lower() #convert to lowercase
        if word in stopWords: #remove stopwords
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    global sortedFreqTable #made it global because we will likely need it down the road
    sortedFreqTable = sorted(freqTable.items(), key=lambda x: x[1], reverse=True)

    sentences = sent_tokenize(text) #split into sentences
    global sentenceValue
    sentenceValue = dict()
    
    for sentence in sentences:
        for word, freq in freqTable.items():
            if word in sentence.lower(): #lowercase the sentence, so that there is no mismatching due to capitalization
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq #add the word's frequency to the sentence's frequency, to calculate the sentence's value. Higher value = more important words = more important sentence
                else:
                    sentenceValue[sentence] = freq

    sumValues = 0
    for sentence in sentenceValue:
        sumValues += sentenceValue[sentence] #sum the values
    
    # Average value of a sentence from the original text
    average = int(sumValues / len(sentenceValue))
    
    #adds the first sentence (usually the topic sentence in a news article) into the summary, followed by ones with high value scores. Optimised for news articles, not for general text.
    global summary 
    summary = sentences[0]
    for sentence in sentences[1:]:
        if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.3 * average)):
            summary += " " + sentence

    return summary #, sortedFreqTable

In [363]:
#categoriser function
#input: full article text 
#output: list of nouns and their frequencies, which can be used to categorise the article

def categoriser(input_text):
    # Replace 'US', 'U.S', with 'USA'. This is so that the POS tagger doesn't get confused by the word 'us' (collective term), and tags 'USA' as a noun.
    # Because referring to the country is done in full caps, we can distinguish from 'us'
    input_text = input_text.replace('US', 'USA')
    input_text = input_text.replace('U.S', 'USA')

    #replace 'United States' with 'USA'. This is to avoid the POS tagger tagging 'United' and 'States' as two different words.
    input_text = input_text.replace('United States', 'USA')
    
    #lemmatize input_text 
    words = word_tokenize(input_text)
    lemmatized_words = []
    for word in words:
        lemmatized_words.append(lemmatizer.lemmatize(word))
    input_text = ' '.join(lemmatized_words)

    #perform part-of-speech (POS) tagging on the input text, to identify nouns
    pos_tagged = nltk.pos_tag(word_tokenize(input_text.lower()))

    #convert nltk POS tags to wordnet POS tags
    def pos_tagger(nltk_tag): #identifies nouns only
        if nltk_tag.startswith('N'):
            return wordnet.NOUN
        else:         
            return None

    #Mapping the pos_tagger function to the pos_tagged list. The pos_tagger function returns the wordnet.NOUN if the nltk_tag starts with 'N', otherwise it returns None.
    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))

    #create array with only nouns
    noun_list = []
    for i in range(len(wordnet_tagged)):
        if wordnet_tagged[i][1] == wordnet.NOUN:
            noun_list.append(wordnet_tagged[i][0].capitalize()) #13 July 2022: capitalise category names

    #make the terms 'HIMARS', 'UAV' and 'MANPAD' fully capitalised in noun_list. WIP: lemmatizer cannot catch plural forms of these terms, eg. MANPADs
    noun_list = [x.replace('Himars', 'HIMARS') for x in noun_list]
    noun_list = [x.replace('Uav', 'UAV') for x in noun_list]
    noun_list = [x.replace('Uavs', 'UAV') for x in noun_list]
    noun_list = [x.replace('Manpad', 'MANPAD') for x in noun_list]
    noun_list = [x.replace('Manpads', 'MANPAD') for x in noun_list]

    #sort noun_list by frequency
    noun_freq = nltk.FreqDist(noun_list)
    sorted_noun_freq = sorted(noun_freq.items(), key=lambda x: x[1], reverse=True)

    #remove nouns in stopWords, with frequency < 2, that is not a word (like punctuation), and with length > 1 character (removes "I", etc.)
    sorted_noun_freq = [x for x in sorted_noun_freq if x[0] not in stopWords and x[1] > 1 and x[0].isalpha() and len(x[0]) > 1]

    return sorted_noun_freq

In [364]:
#backup 12 Jul; USE THE ONE ABOVE

#categoriser function
#input: full article text 
#output: list of nouns and their frequencies, which can be used to categorise the article

def backup_categoriser(input_text):
    tokenised_text = word_tokenize(input_text)
    #lowercase tokenised_text 
    tokenised_text = [word.lower() for word in tokenised_text]

    #remove 'has' from tokenised_text. Maybe do the pos tagging before tokenising.
    tokenised_text = [word for word in tokenised_text if word != 'has']

    #lemmatize tokenised_text
    #tokenised_text = [lemmatizer.lemmatize(word) for word in tokenised_text]

    pos_tagged = nltk.pos_tag(tokenised_text)

    def pos_tagger(nltk_tag): #identifies nouns only
        # if nltk_tag.startswith('J'):
        #     return wordnet.ADJ
        # elif nltk_tag.startswith('V'):
        #     return wordnet.VERB
        if nltk_tag.startswith('N'):
            return wordnet.NOUN
        # elif nltk_tag.startswith('R'):
        #     return wordnet.ADV
        else:         
            return None

    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))

    #create array with only nouns
    noun_list = []
    for i in range(len(wordnet_tagged)):
        if wordnet_tagged[i][1] == wordnet.NOUN:
            noun_list.append(wordnet_tagged[i][0])

    #sort noun_list by frequency
    noun_freq = nltk.FreqDist(noun_list)
    sorted_noun_freq = sorted(noun_freq.items(), key=lambda x: x[1], reverse=True)

    #remove nouns with frequency 1
    sorted_noun_freq = [x for x in sorted_noun_freq if x[1] > 1 and x[0].isalpha() and len(x[0]) > 1]

    #remove nouns in StopWords
    sorted_noun_freq = [x for x in sorted_noun_freq if x[0] not in stopWords] 

    return sorted_noun_freq

In [365]:
original_text = input("input text here:")

print(summariser(original_text))
print(categoriser(original_text)[:7]) #print first 7 entries in categoriser; can print more


LONDON/FRANKFURT (REUTERS) - The biggest single pipeline carrying Russian gas to Germany started annual maintenance on Monday (July 11), with flows expected to stop for 10 days, but governments, markets and companies are worried the shutdown might be extended due to war in Ukraine. The Nord Stream 1 pipeline transports 55 billion cubic metres (bcm) a year of gas from Russia to Germany under the Baltic Sea. Europe fears Russia may extend the scheduled maintenance to restrict European gas supply further, throwing plans to fill storage for winter into disarray and heightening a gas crisis that has prompted emergency measures from governments and painfully high bills for consumers. German Economy Minister Robert Habeck has said the country should confront the possibility that Russia will suspend gas flows through Nord Stream 1 beyond the scheduled maintenance period. There are other big pipelines from Russia to Europe but flows have been gradually declining, especially after Ukraine halted

In [None]:
import nltk, string
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()

d1 = input("input D1 here:")
d2 = input("input D2 here:")
# d3 = input("input D3 here:") #to input more documents
# d4 = input("input D4 here:") #to input more documents
documents = [d1, d2] #d3, d4]

nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')

def cos_similarity(textlist):
    tfidf = TfidfVec.fit_transform(textlist)
    return (tfidf * tfidf.T).toarray()

cos_similarity(documents)

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

#Initialise BERT ZeroShot model
checkpoint = "facebook/bart-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
classifier = pipeline("zero-shot-classification")

In [391]:
original_text = input("input text here:")


#BERT ZeroShot categorizer 
#input: full article text
#output: list of categories and their confidence scores 


labels = ['Tank', 'Artillery', 'UAV', 'Fighter Aircraft', 'Helicopter', 'Missile', 'MANPAD', 'Infrastructure'] #edit candidate labels here

results = classifier(original_text, candidate_labels=labels)

print(results['labels'][0], round(results['scores'][0]/results['scores'][0], 5))
print(results['labels'][1], round(results['scores'][1]/results['scores'][0], 5))
print(results['labels'][2], round(results['scores'][2]/results['scores'][0], 5))
print(results['labels'][3], round(results['scores'][3]/results['scores'][0], 5))
print(results['labels'][4], round(results['scores'][4]/results['scores'][0], 5))

Artillery 1.0
Missile 0.27611
Tank 0.23053
Infrastructure 0.19795
MANPAD 0.1238


In [374]:
#print top category and confidence score
print(results['labels'][0], results['scores'][0])

Helicopter 0.18488530814647675
