# Intallations


This code installs several Python packages (tiktoken, cohere, openai, langchain, chromadb, and unstructured) using the pip package manager. These packages offer various functionalities, such as token counting (tiktoken), language analysis (cohere, langchain), access to the OpenAI API (openai), and other utility functions related to working with unstructured data (chromadb, unstructured). These installations enable users to leverage different tools for natural language processing, data analysis, and accessing external APIs.

In [None]:
!pip install -q tiktoken
!pip install -q cohere
!pip install -q openai
!pip install -q langchain
!pip install -q chromadb
!pip install -q unstructured

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m313.2 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires openai, which is not installed.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.1/212.1 kB[0m [31

# Required Functions

This code provides various utility functions for interacting with the OpenAI GPT model in a chat-based application. It includes functionalities such as formatting and printing chat completion messages, calculating daily usage costs of the OpenAI API, token counting, and data processing.

**Functions:**

1. **print_plain_text(chat_completion_message, line_width=120):**

* Formats and prints chat completion messages, considering bold, italic, bullet points, and numbered points.
* Returns the clean text without HTML tags.

2. **calculate_daily_usage(api_key):**

* Calculates the used balance in dollars for the current day based on the OpenAI API usage.
* Requires an API key for authentication.

3. **count_tokens_generic(input_string):**

* Counts the number of tokens in a given input string using the NLTK word tokenizer.

4. **trim_tokens_nltk(input_string, max_tokens=10000):**

* Trims the input string to have a maximum specified number of tokens using the NLTK word tokenizer.
* Provides an estimated number of tokens after trimming.

5. **load_data(filename):**

* Loads data from a text file specified by the filename.

6. **upload_file():**

* Allows users to upload a dataset file and choose to consider the entire dataset or a specific column for analysis.
* Supports .txt, .xlsx, and .csv file formats.
* Provides an estimated number of tokens in the uploaded data.


**Usage:**
* Users can interactively upload datasets, choose analysis options, and prompt the OpenAI model for responses.
* The code facilitates user-friendly formatting of chat completion messages and tracking daily usage costs.
* This set of utilities enhances the efficiency of utilizing OpenAI's GPT model in a chat-based application.

In [None]:
import requests
import datetime
import openai
import nltk
import re
import textwrap
import os
import math
import string
import pandas as pd

from google.colab import files
from openai import OpenAI
from decimal import Decimal
from io import open

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
def print_plain_text(chat_completion_message, line_width=120):
    # Split the content into paragraphs
    paragraphs = chat_completion_message.content.split('\n\n')

    for paragraph in paragraphs:
        # Wrap lines within each paragraph
        wrapped_lines = textwrap.wrap(paragraph, width=line_width)

        # Process and print each wrapped line
        for line in wrapped_lines:
            # Check for bold and italic formatting
            formatted_line = re.sub(r'\*\*(.*?)\*\*', r'\033[1m\1\033[0m', line)  # Bold
            formatted_line = re.sub(r'\*(.*?)\*', r'\033[3m\1\033[0m', formatted_line)  # Italic

            # Check for bullet points
            if formatted_line.strip().startswith("- "):
                formatted_line = re.sub(r'^- (.*)$', r'\033[1m•\033[0m \1', formatted_line)

            # Check for numbered points
            elif re.match(r'^\d+\. ', formatted_line):
                formatted_line = re.sub(r'^(\d+)\. (.*)$', r'\033[1m\1.\033[0m \2', formatted_line)

            # Print the formatted line
            print(formatted_line)

    clean_text = re.sub("<[^>]*>", "", chat_completion_message.content).strip()
    return clean_text


def calculate_daily_usage(api_key):
    # API headers
    headers = {'Authorization': f'Bearer {api_key}'}

    # API endpoint
    url = 'https://api.openai.com/v1/usage'

    # Get the current date
    current_date = datetime.date.today()

    # Parameters for API request
    params = {'date': current_date.strftime('%Y-%m-%d')}

    # Send API request and get response
    response = requests.get(url, headers=headers, params=params)

    if response.status_code == 200:
        usage_data = response.json().get('data', [])

        # Calculate used balance in dollars
        total_tokens = 0
        for data in usage_data:
            total_tokens += data['n_generated_tokens_total'] + data['n_context_tokens_total']

        used_balance_in_tokens = total_tokens
        used_balance_in_dollars = (used_balance_in_tokens * 0.002) / 1000

        return used_balance_in_dollars
    else:
        print(f"Error: {response.status_code}, {response.json()}")
        return None


def count_tokens_generic(input_string):
    # Download the Punkt tokenizer models if not already downloaded
    # nltk.download('punkt')

    # Tokenize the input string using the nltk word tokenizer
    tokens = nltk.word_tokenize(input_string)

    return len(tokens)


def trim_tokens_nltk(input_string, max_tokens=10000):
    # Tokenize the input string using the nltk word tokenizer
    tokens = nltk.word_tokenize(input_string)

    if len(tokens) > max_tokens:
        # Truncate the list of tokens to the maximum number of tokens
        trimmed_tokens = tokens[:max_tokens]
        # Join the tokens back into a string
        input_string = ' '.join(trimmed_tokens)

        estimated_tokens = count_tokens_generic(input_string)
        print(f"\nEstimated number of tokens after trimming: {estimated_tokens}")

    return input_string


def load_data(filename):
    with open(filename, 'r') as f:
        dataset = f.read()
    return dataset

def upload_file():
    print('Please select "Choose File" to select a dataset from your hard disk for upload and analysis: \n')
    uploaded = files.upload()
    if not uploaded:
        print("File upload canceled.")
        return None

    file_name = next(iter(uploaded))
    print(f"Uploaded file: {file_name}")

    if file_name.lower().endswith(('.txt', '.text')):
        # If it's a text file, load it directly
        return load_data(file_name)
    elif file_name.lower().endswith(('.xlsx', '.csv')):
        # If it's an Excel or CSV file, convert to text
        df = pd.read_excel(file_name) if file_name.lower().endswith('.xlsx') else pd.read_csv(file_name)
        print('\n----------------------------------------------------------------------------------------------------------------\n')
        print('Do you want to consider a single column of the dataset?')
        print('Type "yes" if you want to choose a single column, and type "no" if you wish to use the entire dataset:')
        trim_data_bool = input()

        print('\n----------------------------------------------------------------------------------------------------------------\n')
        if(trim_data_bool == 'yes'):
            print('Please type the title of the column you wish to consider for analysis:')
            single_feature = input()
            if(single_feature in df.columns):
                df = df[single_feature]
                print(f"\nDataset uploaded and column '{single_feature}' is selected for analysis.")
            else:
                df = df
                print(f"\n'{single_feature}' is not found in the column titles of the dataset. \nTherefore, entire dataset uploaded and ready for analysis.")
        else:
            df = df
            print(f"Entire dataset uploaded and ready for analysis.")

        #print('\n----------------------------------------------------------------------------------------------------------------\n')
        text_data = df.to_string(index=False)

        estimated_tokens = count_tokens_generic(text_data)
        print(f"\nEstimated number of tokens: {estimated_tokens}")

        return text_data
    else:
        print("Unsupported file format. Please upload a .txt, .xlsx, or .csv file.")
        return None

The code below defines a main loop function for an interactive chat-based application using the OpenAI GPT-4 model. The loop allows users to prompt the model with questions related to a provided dataset, analyze market dynamics, and store the conversation history. Here's a breakdown of the key features:

**Features:**

1. **User Guide:**

* Provides a user-friendly guide with instructions on how to interact with the chat application.
* Informs users about commands to exit, quit, or delete chat history.

2. **Chat History Handling:**

* Checks for the existence of a chat history file.
* Asks the user if they want to delete existing chat history.
* Deletes or loads chat history accordingly.

3. **Interactive Loop:**

* Prompts the user to input questions or prompts.
* Handles exit commands to stop the chat loop.

4. **System and User Roles:**

* Defines roles for the system (market analyzer) and the user.
* Incorporates provided dataset and chat history into the system role.
* Generates a user role based on the user's prompt.

5. **OpenAI GPT-4 Interaction:**

* Utilizes the OpenAI GPT-4 model to generate responses.
* Employs the Chat API for chat-based completions.

6. **Output Formatting:**

* Prints the analyzer's output in a formatted manner using the print_plain_text function.

7. **Storage of Prompts and Outputs:**

* Stores user prompts and model-generated outputs in a dictionary (prompts_and_outputs).
* Appends prompt-response pairs to the chat history file.

**Usage:**
* Users can interactively provide prompts and receive responses from the GPT-4 model.
* The chat history is maintained and can be deleted or loaded based on user preference.
* Prompts and responses are stored for future reference.

This main loop function enhances the user experience in querying and analyzing a dataset using the OpenAI GPT-4 model within a chat-based application.

In [None]:
# Function for the main loop
def main_loop(api_key, chat_history_file, dataset):
    client = OpenAI(api_key=api_key)
    prompts_and_outputs = {}

    print('\n\033[1m------------------------------------------------------------------------------------------------------------------------\033[0m \n')
    print("\033[1mUser Guide\033[0m")
    print('1) Please type your prompt to ask questions about the dataset.')
    print('2) You can use "q", "quit", or "exit" commands to stop the code from asking for prompts.')
    print('3) If asked for deleting the previous chat history or not, typing "yes" will delete it and then the model')
    print('   will not consider previous chat data in its answers. Otherwise, type "no" for using the previous chat data.')
    print('   (These words are case-sensitive)')
    print('\n\033[1m------------------------------------------------------------------------------------------------------------------------\033[0m \n')

    # Check if the file exists
    if os.path.exists(chat_history_file):
        # Ask the user if they want to delete the existing data
        user_input = input("Previous chat history exists. Do you want to delete it? (yes/no): ").lower()

        if user_input == 'yes':
            # Delete existing data
            open(chat_history_file, 'w').close()
            print("Existing chat history deleted.")
            chat_history_string = ""
        else:
            # Load existing data
            with open(chat_history_file, 'r') as f:
                chat_history_string = ''.join(f.readlines())
            print("Existing chat history loaded.")
    else:
        # Handle the case where the file is not found
        open(chat_history_file, 'w').close()
        chat_history_string = ""
        print('chat_history.txt created to save the conversation history.')

    print('\n------------------------------------------------------------------------------------------------------------------------\n')

    while True:
        with open(chat_history_file, 'r') as f:
            # Read all lines and join them into a single string
            chat_history_string = ''.join(f.readlines())

        user_prompt = input('Prompt: ')
        print('\n------------------------------------------------------------------------------------------------------------------------\n')

        if user_prompt == 'exit' or user_prompt == 'q' or user_prompt == 'quit':
            api_key='' # Provide your API key here
            # Call the function to calculate daily usage
            daily_usage = calculate_daily_usage(api_key)

            if daily_usage is not None:
                print(f"Used balance in dollars for today: ${daily_usage:.6f}")

            break

        system_role = f"You are a market analyzer that reviews all of the text provided to you which may include customer feedbacks and reviews, sales data, price data, etc from a business. Then you should analyze that business, it's strength and weaknesses, analyzes the market and the gaps and needs in that market. Then having that data in memory, wait for the user prompt to analyze the business in the provided data according to the user prompt. \nHere is the provided data: \n{dataset}. Also, here is the chat history so far: {chat_history_string}."

        user_role = f"Here is the prompt: \n{user_prompt}."

        completion = client.chat.completions.create(
            #model="gpt-3.5-turbo",
            model="gpt-4-1106-preview",
            messages=[
                {"role": "system", "content": system_role},
                {"role": "user", "content": user_role}
            ]
        )

        print('\033[1mAnalyzers Output:\033[0m \n')
        prompt_output = print_plain_text(completion.choices[0].message)
        print('\n------------------------------------------------------------------------------------------------------------------------\n')

        # Store prompt and output
        prompts_and_outputs[user_prompt] = prompt_output

        # Append prompt and response to chat history
        with open(chat_history_file, 'a') as f:
            f.write(f'Prompt: {user_prompt}\nAnswer: {prompt_output}\n')


This code snippet showcases how to calculate and display the daily usage cost in dollars for the OpenAI GPT-4 model. Key features include API key integration, usage data retrieval, and concise output of the daily cost. Users can easily monitor their OpenAI GPT-4 usage and associated expenses:

In [None]:
# Provide your API key here
api_key = ""

# Call the function to calculate daily usage
daily_usage = calculate_daily_usage(api_key)

if daily_usage is not None:
    print(f"Used balance in dollars for today: ${daily_usage:.6f}")

Used balance in dollars for today: $0.098406


# API Code

This code segment serves a dual purpose: it enables users to upload a dataset file for analysis and subsequently trims the dataset to a specified maximum token limit using the NLTK library. By employing the upload_file function, users can conveniently bring in datasets in various formats, and the trim_tokens_nltk function ensures that the dataset does not exceed a specified token threshold for effective processing with the ChatGPT context window limitations:

In [None]:
# Use the function to upload and process the file
dataset = upload_file()
dataset = trim_tokens_nltk(dataset)

Please select "Choose File" to select a dataset from your hard disk for upload and analysis: 



Saving test_yourmechanic_dataset.xlsx to test_yourmechanic_dataset.xlsx
Uploaded file: test_yourmechanic_dataset.xlsx

----------------------------------------------------------------------------------------------------------------

Do you want to consider a single column of the dataset?
Type "yes" if you want to choose a single column, and type "no" if you wish to use the entire dataset:
yes

----------------------------------------------------------------------------------------------------------------

Please type the title of the column you wish to consider for analysis:
review-text

Dataset uploaded and column 'review-text' is selected for analysis.

Estimated number of tokens: 9796


This code initiates the main loop for a chat analysis system. It involves providing the OpenAI API key, specifying the chat history file path, and calling the main_loop function. Within this loop, users interact by inputting prompts related to a dataset, and the system generates responses for analysis. Additionally, it manages the chat history, allowing users to decide whether to delete previous chat data for each session:

In [None]:
api_key = '' # Provide your API key here
chat_history_file = "chat_history.txt"  # Replace with the actual file path

# Call the main loop function
main_loop(api_key, chat_history_file, dataset)


[1m------------------------------------------------------------------------------------------------------------------------[0m 

[1mUser Guide[0m
1) Please type your prompt to ask questions about the dataset.
2) You can use "q", "quit", or "exit" commands to stop the code from asking for prompts.
3) If asked for deleting the previous chat history or not, typing "yes" will delete it and then the model
   will not consider previous chat data in its answers. Otherwise, type "no" for using the previous chat data.
   (These words are case-sensitive)

[1m------------------------------------------------------------------------------------------------------------------------[0m 

chat_history.txt created to save the conversation history.

------------------------------------------------------------------------------------------------------------------------

Prompt: what is the data about?

-------------------------------------------------------------------------------------------------