# Prompt Engineering With Dolly
Dolly is a language model developed by Databricks. It is a variant of the GPT-3 model, which is designed to generate human-like text based on the input it is given. Dolly has been trained on a diverse range of internet text, but it does not know specific documents or sources it was trained on.

**How Dolly Works:**
Dolly generates text by predicting the next word in a sentence. It uses the context of the input provided to it to make these predictions. For example, if you provide Dolly with the input "The sky is", it might predict the next word to be "blue". It continues generating words until it reaches a specified length or until it generates an end-of-sentence token.

**Pros of Dolly:**
1. **Versatility:** Dolly can generate a wide variety of content, from stories and poems to technical reports and code.
2. **Coherency:** The text generated by Dolly is often coherent and grammatically correct, making it useful for a wide range of applications.
3. **Customizability:** You can guide Dolly's output by carefully crafting the input. For example, you can ask it to generate a story in the style of a specific author, or to write a poem about a specific topic.

**Cons of Dolly:**
1. **Lack of Understanding:** Dolly doesn't understand the text it generates. It doesn't have beliefs or desires. It generates text based on patterns it learned during training.
2. **Inappropriate Content:** Dolly can sometimes generate inappropriate or biased content, as it learns from the data it was trained on, which includes the biases present in that data.
3. **Unpredictability:** It can be hard to predict what Dolly will generate, especially for longer pieces of text. It might not always generate the content you want, and might require multiple attempts to get a satisfactory output.


In [162]:
#imports
import warnings
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelWithLMHead
import re
import pandas as pd
import os
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None) 

In [4]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(r"C:\Users\samuel.cannon\OneDrive - LinQuest Corporation\Desktop\decorations-generator\decorations-generator\dolly", padding_side='left')
model = AutoModelForCausalLM.from_pretrained(r"C:\Users\samuel.cannon\OneDrive - LinQuest Corporation\Desktop\decorations-generator\decorations-generator\dolly")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50280, 2560)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=2560, out_features=7680, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True)
          (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True)


In [6]:
tokenizer

GPTNeoXTokenizerFast(name_or_path='C:\Users\samuel.cannon\OneDrive - LinQuest Corporation\Desktop\decorations-generator\decorations-generator\dolly', vocab_size=50254, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['### End', '### Instruction:', '### Response:']}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|padding|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50254: AddedToken("                        ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50255: AddedToken("                       ", rstrip=False, lstrip=False, single_word=False, norm

In [10]:
# Define the input text
input_text = "how large is the moon?"

# Encode the input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate a response
output = model.generate(input_ids, max_length=30, temperature=0.2, num_return_sequences=1)

# Decode the response
response = tokenizer.decode(output[0], skip_special_tokens=True)

print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


how large is the moon?

The moon is about one-third the size of the earth.

The moon is about one-third


In [11]:
#open data file and read in contents 
with open('_AIData_JSON.txt', 'r') as file:
    contents = file.read()
    
print(contents)


{
    "ID": "001",
    "RANK": "2LT",
    "YEAR": "2016",
    "DEC": "AFAM",
    "ANOMALY": 1,
    "NARRATIVE": "During this time, the outstanding professional skill, leadership and ceaseless efforts of Lieutenant XXX resulted in significant contributions to the effectiveness and success of the Air Force's only Undergraduate Combat Systems Officer training program.  Additionally, he administered 140 practice evaluation tests in Air Education and Training Command's busiest Standardization and Evaluations Flight, pivotal to an overall 98 percent test average.  furthermore, Lieutenant XXX inspected 163 squadron flight evaluation folders and tracked 372 written exams, ensuring a 100 percent on time completion rate and identifying 183 discrepancies.  Finally, he compiled T-1A takeoff & landing data of 64 runway configurations, which streamlined student training.",
    "TYPE1": "TR",
    "EVAL1": [
        "2Lt XXX is being eliminated from UCT due to Lack of Adaptability.",
        "2LT XXX 

In [12]:
# Pulls each chunk of text between curly braces into a list
chunks = re.findall(r'\{([^}]+)\}', contents)

df = pd.DataFrame(chunks, columns=['text'])

In [13]:
# Remove unwanted characters from strings
data = [''.join(x).replace("'", '').replace('"', '').replace(':', '').replace('\n', '') for x in df['text']]

# Create the dataframe of narratives
eval_df = pd.DataFrame(data).rename(columns={0:'evaluation'})
narrative_df = pd.DataFrame(data).rename(columns={0:'narrative'})

# Extract the evaluation data from the 'evaluation' column
eval_data = eval_df['evaluation'].str.extract(r'EVAL(.*)', expand=False).str.strip()
# Extract the data between 'NARRATIVE' and 'EVAL' in each row of the dataframe
narrative_df['narrative'] = narrative_df['narrative'].str.extract(r'NARRATIVE(.*?)EVAL', expand=False).str.strip()


# Create a new dataframe with the extracted evaluation data
eval_df = pd.DataFrame(eval_data)
narrative_df = pd.DataFrame(narrative_df['narrative'])

# Extract the text between brackets in the "evaluation" column
eval_df['evaluation'] = eval_df['evaluation'].str.extract(r'\[(.*?)\]')


In [14]:
df_combined = pd.concat([eval_df, narrative_df], axis=1)

# Replace dashes with commas
df_combined = df_combined.replace('-', ',', regex=True)

# Remove commas at the beginning of rows
df_combined = df_combined.apply(lambda x: x.str.lstrip(','))

In [10]:
prompt = """
you are responsible for creating a positive summary of the person who accomplished the following tasks outlined in the text below, summarize the text in a manner that reads like an overview:\n
 ,Definitely a superstar and top,shelf performer,
,excelled at all assigned tasks and exceeded all expectations, ,Deployed as Third Country National security escort in support of OPERATION ENDURING FREEDOM ,
 ,Supervised more than 300 multinational personnel supporting 12.9 million dollar project,,zero incidents ,
   ,Noteworthy performance helped clinch success of largest Air Combat Command construction project,
     ,Single,handedly combined base publications and records managements paper and electronic official files ,
       ,Tremendously enhanced customer service support,,reduced document search and review time by 60%, 
       ,Assisted in the restructure and upgrade of the wing information management (IM) deployment resources, 
       ,Improved IM Function readiness and ability to deploy assets within a 24,hour notice without a hitch,
         ,Laudable efforts guaranteed the on,time delivery of official mail and parcels without one discrepancy, 
         ,Served as an entry control monitor for critical telephone switch facility following the September 11 attacks,
           ,Prevented security violations and ensured 100 % accountability of facility personnel and equipment,
             ,Strong effort to sharpen skills,,completed more than 80 computer based training modules in three months,
               ,Ex ertl ublished weekl official bulletins,,first class product kept entire wing informed,,promote now!, 
               ,Assisted Workgroup Management Program Office in revamping its computer and network training center ,
                 ,Set up computers, ran cables, loaded software,,increased network certification capability 66 percent,
                   ,Meticulously sorted over 6,000 x,rays for submission to Defense Reutilization and Marketing Office,
                     ,Recycling zeal enabled silver to be reused,,saved tax dollars and provided additional Air Force revenue,
                       ,Air Force ambassador,,participated in many on and off,base activities during SJAFB Honor Guard duty,
                         ,An airman of standards,,stron work ethic and im eccable dress and a earance,, romote below,the,zone!	
"""

# Encode the prompt with attention_mask
input_ids = tokenizer(prompt, return_tensors='pt', padding='max_length', truncation=True, max_length=800, return_token_type_ids=False, return_attention_mask=True)

# Generate the output
output = model.generate(input_ids['input_ids'], attention_mask=input_ids['attention_mask'], max_length=800, temperature=.5, max_new_tokens=200)

# Decode the output
response = tokenizer.decode(output[0], skip_special_tokens=True)

#remove all commas 
response = response.replace(',', '')

print(response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Both `max_new_tokens` (=200) and `max_length`(=800) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



you are responsible for creating a positive summary of the person who accomplished the following tasks outlined in the text below summarize the text in a manner that reads like an overview:

Definitely a superstar and topshelf performer
excelled at all assigned tasks and exceeded all expectationsDeployed as Third Country National security escort in support of OPERATION ENDURING FREEDOM
Supervised more than 300 multinational personnel supporting 12.9 million dollar projectzero incidents
  Noteworthy performance helped clinch success of largest Air Combat Command construction project
    Singlehandedly combined base publications and records managements paper and electronic official files
      Tremendously enhanced customer service supportreduced document search and review time by 60% 
      Assisted in the restructure and upgrade of the wing information management (IM) deployment resources 
      Improved IM Function readiness and ability to deploy assets within a 24hour notice without

In [15]:
df_combined.head(2)

Unnamed: 0,evaluation,narrative
0,"2Lt XXX is being eliminated from UCT due to Lack of Adaptability., 2LT XXX bearing, appearance, and physical fitness have been excellent during this course and while awaiting reassignment. His professionalism will be a valuable asset for any unit to which he is","During this time, the outstanding professional skill, leadership and ceaseless efforts of Lieutenant XXX resulted in significant contributions to the effectiveness and success of the Air Forces only Undergraduate Combat Systems Officer training program. Additionally, he administered 140 practice evaluation tests in Air Education and Training Commands busiest Standardization and Evaluations Flight, pivotal to an overall 98 percent test average. furthermore, Lieutenant XXX inspected 163 squadron flight evaluation folders and tracked 372 written exams, ensuring a 100 percent on time completion rate and identifying 183 discrepancies. Finally, he compiled T,1A takeoff & landing data of 64 runway configurations, which streamlined student training., TYPE1 TR,"
1,", Directs spt to 6K active duty/guard/reserve/retirees/dependents at Air Forces only Joint Specialized UPT base, , Delivers procedural guidance on assignment and formal training processes to 18 group/squadron commanders, , Audits/updates MilPDS for integrity; counsels Airmen on SGLI program; customer service rep/issues ID cards, , Trains/leads deployment functions; manages Airmen in,processing actions/force shaping & retraining programs, , Located/Uploaded 10 missing evals for six promotion boards...assured mbrs rec rdy for AFPC brd w/no delay, , #1 vRED accuracy rate in AETC; scrutinized 1.3K personnel records/revised address format...99.7% accurate, , Led monthly MilPDS data review/refined process; 92% reduction in sys rejects...certified wings data integrity, , Reorganized customer service workflow; reduced reqd visits...cut customer processing and wait,time by 50%, , Spearheaded 386 Automated Records/495 leave trans; rejects reduced by 35%; zero repeats; 38 man,hrs saved, , Implemented auto RIP delivery sys; daily products now sent directly to units; saved MPS $3.6K/300 man,hours, , Finished 60 hrs of DCAPES trng; 10 hrs of search & recovery; selected to fill joint tasking in spt OCO mission, , #1/22 team mbrs selected to attended 80 hour formal honor guard tng class; only 1 of 2 wg honor guard trainers, , Elite Silver Talon Honor Guard member; worked 18 on/off duty details...90 hours covering the tri,state region, , 1st in AETC to update 60 officer duty history/promotion board discrepancies; 3 month task completed in 2 wks, , Efficiently authorized $2M in benefits/issued 954 CACs/dep IDs; cut customer wait by 47%...avg 8 min vs 15, , Superb leader/team player; SSgt selection well deserved; first choice for NCOIC, Force Management Section!","During this period, Sergeant XXX superbly supported Operation ENDURING FREEDOM by ensuring the accountability and processing of United States and coalition forces. Sergeant XXXX executed the reception operations for 315,000 personnel ensuring warfighters were properly briefed and processed. His oversit of 3,200 off base request trips with 7,000 force protection status checks ensured the safety and security of wing personnel. Additionally, Sergeant XXX was instrumental in processing 26 reenlistment bonuses, resulting in 310,000 dollars in tax free entitlements for Air Force personnel. Furthermore, he participated in the Koi Tash Noncommissioned Officer as part of Theater Security Cooperation Russian and English language exchange enhancing the United States and Kyrgyz Republic partnership. Finally, his exemplary performance led to his team garnering the 376th Expeditionary Mission Support Group Team of the Month Award for November 2011., TYPE1 EPR,"


## Framing the Summarization Problem as a Question & Answering Problem Instead
When it comes to summarizing information, Dolly can certainly do the job. However, there are certain advantages to framing the summarization task as a question and answering problem instead.

1. **Specificity**: By asking a question, you can guide Dolly to focus on specific aspects of the information. This can help in generating more relevant summaries.

2. **Contextual Understanding**: Dolly is designed to understand context and generate responses accordingly. By asking questions, you can leverage this ability to get more contextually accurate summaries.

3. **Interactive**: The question and answering approach is more interactive. You can ask follow-up questions based on Dolly's responses to get deeper insights.

4. **Control**: Questions give you more control over the output. You can ask different types of questions (what, why, how, etc.) to get different perspectives on the information.

In contrast, instructing Dolly to summarize something might result in a more general summary that might not focus on the aspects you're interested in. Therefore, using Dolly for question and answering can often be a more effective approach for summarizing information.


In [180]:
# Define the input text
# could we change the approach of the project into a question answering task instead of summarization?
input_text = "my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks 'John Directed spt to 6K active duty/guard/reserve/retirees/dependents at Air Forces only Joint Specialized UPT base, , John then Delivered procedural guidance on assignment and formal training processes to 18 group/squadron commanders' to create a glowing job review for an individual?"
# Encode the input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate a response
output = model.generate(input_ids, max_length=200, temperature=0.1, num_return_sequences=1)

# Decode the response
response = tokenizer.decode(output[0], skip_special_tokens=True)

print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks 'John Directed spt to 6K active duty/guard/reserve/retirees/dependents at Air Forces only Joint Specialized UPT base,, John then Delivered procedural guidance on assignment and formal training processes to 18 group/squadron commanders' to create a glowing job review for an individual? 
The following are the key points I would like to highlight:

John is a highly regarded leader who has a proven track record of success in building strong working relationships and delivering results for our organization. John has a proven track record of effectively leading complex, multi-disciplinary teams to deliver results for our organization. John has a proven track record of building and leading high performing teams in a fast-paced, dynamic environment. John has a proven track record of effectively leading and developing junior leaders. John has a proven track record of effectively le

In [181]:
# Define the input text
# could we change the approach of the project into a question answering task instead of summarization?
input_text = "my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks 'John Audits/updates MilPDS for integrity; counsels Airmen on SGLI program; customer service rep/issues ID cards, , He Trains/leads deployment functions; manages Airmen in,processing actions/force shaping & retraining programs' to create a glowing job review for an individual?"
# Encode the input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate a response
output = model.generate(input_ids, max_length=200, temperature=0.1, num_return_sequences=1)

# Decode the response
response = tokenizer.decode(output[0], skip_special_tokens=True)

print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks 'John Audits/updates MilPDS for integrity; counsels Airmen on SGLI program; customer service rep/issues ID cards,, He Trains/leads deployment functions; manages Airmen in,processing actions/force shaping & retraining programs' to create a glowing job review for an individual?

A:

You could use the following structure:

John has been an integral part of MilPDS for integrity for the past 5 years, where he has been responsible for counselling Airmen on the SGLI program and managing Airmen in the processing actions force shaping and retraining programs.

John has also been a customer service representative for the past 5 years, where he has assisted Airmen with their ID cards.




## Chaining Models for Batches of Evaluation Bullet Points
Performing batch inference has several advantages over feeding Dolly a single, long prompt:

1. **Efficiency**: Batch inference allows the model to process multiple inputs at once, which can be more efficient than processing a single long input. This is especially true for models that are designed to take advantage of parallel processing capabilities of modern hardware like GPUs.

2. **Memory Management**: Long prompts may exceed the model's maximum sequence length (the maximum number of tokens that the model can handle in a single input). In such cases, the prompt needs to be truncated or split into smaller parts, which might affect the context and the quality of the generated output. On the other hand, batch inference can handle larger amounts of data without running into these issues.

3. **Quality of Output**: When processing a single long prompt, the model might lose context as it generates output, especially if the prompt is near or exceeds the model's maximum sequence length. With batch inference, each input is shorter, allowing the model to maintain better context and potentially produce higher quality output.

4. **Speed**: Batch inference can be faster than single inference, especially on hardware that supports parallel processing. This is because the time taken to load the model and perform other setup tasks is amortized over multiple inputs.

Therefore, for large-scale tasks, batch inference is generally a better choice than feeding a single, long prompt to the model.

In [192]:
def generate_reviews(ticks_list, model, tokenizer):
    """
    Generate job reviews based on a list of ticks.

    Args:
        ticks_list (list): A list of ticks containing specific information.
        model: The model used for generating the reviews.
        tokenizer: The tokenizer used for encoding the input text.

    Returns:
        pandas.DataFrame: A DataFrame containing the generated job reviews.
    """

    # Suppress warnings
    warnings.filterwarnings("ignore")

    # Initialize an empty list to store the reviews
    reviews = []

    # Iterate over the ticks
    for phrase in ticks_list:

        # Format the input text
        input_text = f"my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks '{phrase}' to create a glowing job review for an individual?"

        # Encode the input text
        input_ids = tokenizer.encode(input_text, return_tensors="pt")

        # Generate a response
        output = model.generate(input_ids, max_length=250, temperature=0.1, num_return_sequences=1)

        # Decode the response
        response = tokenizer.decode(output[0], skip_special_tokens=True)

        # Append the response to the reviews list
        reviews.append(response)

    # Convert the reviews list to a pandas DataFrame and return it
    return pd.DataFrame(reviews, columns=['Review'])


# Usage:
ticks_list = ["John Audits/updates MilPDS for integrity; counsels Airmen on SGLI program; customer service rep/issues ID cards, , He Trains/leads deployment functions; manages Airmen in processing actions/force shaping & retraining programs",
              "John Directed spt to 6K active duty/guard/reserve/retirees/dependents at Air Forces only Joint Specialized UPT base, , Delivered procedural guidance on assignment and formal training processes to 18 group/squadron commanders"]

df = generate_reviews(ticks_list, model, tokenizer)

df

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Unnamed: 0,Review
0,"my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks 'John Audits/updates MilPDS for integrity; counsels Airmen on SGLI program; customer service rep/issues ID cards,, He Trains/leads deployment functions; manages Airmen in processing actions/force shaping & retraining programs' to create a glowing job review for an individual?\n\nA:\n\nYou could use the following structure:\n\nJohn has been an integral part of the MilPDS team for many years, and has played a key role in the success of the organization. His contributions have helped to ensure the integrity of the SGLI program, and has helped to ensure that Airmen receive the best customer service possible. John has also been a strong advocate for the use of ID cards, and has helped to ensure that all Airmen receive the necessary documentation to participate in the SGLI program. John has also been a strong advocate for the use of force shaping and retraining programs, and has helped to ensure that all Airmen receive the necessary support to succeed in these programs. John has also been a strong advocate"
1,"my boss has tasked me with writing a job review, how could I summarize the specific information contained in ticks 'John Directed spt to 6K active duty/guard/reserve/retirees/dependents at Air Forces only Joint Specialized UPT base,, Delivered procedural guidance on assignment and formal training processes to 18 group/squadron commanders' to create a glowing job review for an individual?\n\nA:\n\nI would summarize the specific information contained in the ticks by using bullet points.\n\nJohn Directed spt to 6K active duty/guard/reserve/retirees/dependents at Air Forces only Joint Specialized UPT base,\n\nThis could be summarized as follows:\n\nJohn directed Specialist to conduct UPT for 6,000 active duty, guard, reserve, and retirees, and their dependents at Air Forces only Joint Specialized UPT base.\n\nDelivered procedural guidance on assignment and formal training processes to 18 group/squadron commanders.\n\nThese 18 group/squadron commanders received guidance on how to conduct UPT for 18,000 active duty, guard, reserve, and retirees, and their dependents at Air"


In [205]:
def process_reviews(df):
    """
    Process the reviews in the given DataFrame.

    Args:
        df (pandas.DataFrame): The DataFrame containing the reviews.

    Returns:
        pandas.DataFrame: The processed DataFrame with the reviews.

    This function performs several operations on the reviews in the DataFrame:
    1. Removes all text occurring before the first question mark, ensuring that the prompt is not returned.
    2. Removes new line characters from each row in the 'Review' column.
    3. Removes text that occurs after the final period, to avoid partial sentences.
    4. Removes duplicated sentences in each review.
    5. Removes text that occurs prior to the final colon.
    6. Concatenates the rows in the 'Review' column to form one row.

    The processed DataFrame is returned with the updated 'Review' column.

    """

    #remove all text occuring before the first question mark, ensuring we are not returning the prompt
    df['Review'] = df['Review'].apply(lambda x: x.split('?', 1)[-1] if '?' in x else x)

    # Remove new line characters from each row in df['Review']
    df['Review'] = df['Review'].str.replace('\n', ' ')

    # Remove text that occurs after the final period - do not want partial sentences
    df['Review'] = df['Review'].apply(lambda x: '.'.join(x.rsplit('.', 1)[0:1]) + '.')

    #remove text that occurs prior to final colon
    df['Review'] = df['Review'].apply(lambda x: x.split(':')[-1].strip() if ':' in x else x)


    #remove duplicated sentences in review
    def remove_duplicate_sentences(review):
        # Split the review into sentences
        sentences = re.split('(?<=[.!?]) +', review)
        # Initialize an empty list to store the unique sentences
        unique_sentences = []
        # Iterate over the sentences
        for sentence in sentences:
            # If the sentence is not in the list of unique sentences, add it
            if sentence not in unique_sentences:
                unique_sentences.append(sentence)
        # Join the unique sentences back together and return the result
        return ' '.join(unique_sentences)

    # Apply the function to each review
    df['Review'] = df['Review'].apply(remove_duplicate_sentences)

    # Concatenate the rows in df['Review'] together to form one row
    df_concatenated = pd.DataFrame([df['Review'].str.cat(sep=' ')], columns=['Review'])
    
    return df_concatenated

process_reviews(df)

Unnamed: 0,Review
0,"John has been an integral part of the MilPDS team for many years, and has played a key role in the success of the organization. His contributions have helped to ensure the integrity of the SGLI program, and has helped to ensure that Airmen receive the best customer service possible. John has also been a strong advocate for the use of ID cards, and has helped to ensure that all Airmen receive the necessary documentation to participate in the SGLI program. John has also been a strong advocate for the use of force shaping and retraining programs, and has helped to ensure that all Airmen receive the necessary support to succeed in these programs. John directed Specialist to conduct UPT for 6,000 active duty, guard, reserve, and retirees, and their dependents at Air Forces only Joint Specialized UPT base. Delivered procedural guidance on assignment and formal training processes to 18 group/squadron commanders."
