## Microsoft Office Files PPT Excel and Word Reader

**Important Note:**
Unstructured Data Reader Setup

https://python.langchain.com/docs/integrations/providers/unstructured/

### Project 1: Key Notes and Script Generation for PPT Presentor (Speaker)

```bash
OSError: No such file or directory: 'C:\Users\laxmi\AppData\Roaming\nltk_data\tokenizers\punkt\PY3_tab'
```

Error Handling:
C:\Users\laxmi\AppData\Roaming\nltk_data\tokenizers\punkt --> PY3 to PY3_tab



In [2]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# !pip install unstructured openpyxl python-magic python-pptx
# !pip install "unstructured[all-docs]"

In [2]:
from langchain_community.document_loaders import UnstructuredPowerPointLoader
from langchain.schema import Document

In [3]:
loader = UnstructuredPowerPointLoader(file_path="data/ml_course.pptx", mode='elements')

docs = loader.load()

In [4]:
docs

[Document(metadata={'source': 'data/ml_course.pptx', 'category_depth': 0, 'file_directory': 'data', 'filename': 'ml_course.pptx', 'last_modified': '2024-11-03T21:48:01', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': '1378fc8ea4dd0830a3c5f93e9d31a516'}, page_content='Machine Learning Model Deployment'),
 Document(metadata={'source': 'data/ml_course.pptx', 'category_depth': 1, 'file_directory': 'data', 'filename': 'ml_course.pptx', 'last_modified': '2024-11-03T21:48:01', 'page_number': 1, 'languages': ['eng'], 'parent_id': '1378fc8ea4dd0830a3c5f93e9d31a516', 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': '5de650940883c293416c7f769aa0c961'}, page_content='Introduction to ML Pipeline'),
 Document(metadata={'source': 'data/ml_course.pptx', 'category_depth': 1, 'file_directory': 'data', 'filename': '

In [5]:
docs[0].page_content

'Machine Learning Model Deployment'

In [6]:
docs[0].metadata

{'source': 'data/ml_course.pptx',
 'category_depth': 0,
 'file_directory': 'data',
 'filename': 'ml_course.pptx',
 'last_modified': '2024-11-03T21:48:01',
 'page_number': 1,
 'languages': ['eng'],
 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
 'category': 'Title',
 'element_id': '1378fc8ea4dd0830a3c5f93e9d31a516'}

In [7]:
docs[7].page_content

'Batch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.'

In [8]:
docs[7].metadata

{'source': 'data/ml_course.pptx',
 'category_depth': 0,
 'file_directory': 'data',
 'filename': 'ml_course.pptx',
 'last_modified': '2024-11-03T21:48:01',
 'page_number': 3,
 'languages': ['eng'],
 'parent_id': 'e16acab18bc7971064f2d06df301b796',
 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
 'category': 'ListItem',
 'element_id': '36a2a2dc69a5a623ccc3eaa559f874eb'}

In [9]:
len(docs)

47

In [10]:
ppt_data = {}

for doc in docs:
    if isinstance(doc,Document):
        page_number = doc.metadata.get('page_number',None)
        if page_number:
            ppt_data[page_number] = ppt_data.get(page_number,'') + '\n' + doc.page_content

In [11]:
ppt_data

{1: '\nMachine Learning Model Deployment\nIntroduction to ML Pipeline\nhttps://bit.ly/bert_nlp',
 2: '\nWhat is Machine Learning Pipeline?',
 3: '\nType of ML Deployment\nBatch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.\nStream: Stream deployment enables ML models to process and analyze data in real-time as it flows in, suitable for applications like fraud detection or live social media analysis.\nRealtime: Realtime deployment allows ML models to provide instant predictions or decisions in response to incoming data, essential for use cases like recommendation systems or autonomous driving.\nEdge: Edge deployment involves running ML models on local devices close to the data source, reducing latency and bandwidth usage, which is crucial for IoT applications and smart devices.',
 4: '\nInfrastructure and Integration\nHardware and Software: Setting up the right environment for model d

In [12]:
print(ppt_data[3])


Type of ML Deployment
Batch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.
Stream: Stream deployment enables ML models to process and analyze data in real-time as it flows in, suitable for applications like fraud detection or live social media analysis.
Realtime: Realtime deployment allows ML models to provide instant predictions or decisions in response to incoming data, essential for use cases like recommendation systems or autonomous driving.
Edge: Edge deployment involves running ML models on local devices close to the data source, reducing latency and bandwidth usage, which is crucial for IoT applications and smart devices.


In [13]:
def extract_ppt_data(file_path: str, mode: str = 'elements', verbose: bool = False) -> dict:
    """
    Extracts content from a PowerPoint file and organizes it by page number.

    Args:
        file_path (str): Path to the PowerPoint file.
        mode (str): Mode for loading the PowerPoint file. Default is 'elements'.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        dict: A dictionary where keys are page numbers and values are the concatenated content of each page.

    Raises:
        FileNotFoundError: If the file path is invalid.
        ValueError: If the loader fails to process the file.
    """
    if verbose:
        print(f"Initializing UnstructuredPowerPointLoader with file: {file_path} and mode: {mode}...")

    try:
        loader = UnstructuredPowerPointLoader(file_path=file_path, mode=mode)
        docs = loader.load()
        if verbose:
            print(f"Successfully loaded {len(docs)} documents.")
    except FileNotFoundError:
        raise FileNotFoundError(f"The file '{file_path}' does not exist.")
    except Exception as e:
        raise ValueError(f"Failed to load PowerPoint file. Error: {str(e)}")

    ppt_data = {}

    for idx, doc in enumerate(docs, start=1):
        if isinstance(doc, Document):
            page_number = doc.metadata.get('page_number', None)
            if page_number:
                ppt_data[page_number] = ppt_data.get(page_number, '') + '\n' + doc.page_content
            if verbose:
                print(f"Processed document {idx}/{len(docs)}: Page {page_number if page_number else 'break'}")

    if verbose:
        print("Extraction complete.")

    return ppt_data

In [14]:
ppt_data = extract_ppt_data(file_path="data/ml_course.pptx",mode='elements',verbose=True)

Initializing UnstructuredPowerPointLoader with file: data/ml_course.pptx and mode: elements...
Successfully loaded 47 documents.
Processed document 1/47: Page 1
Processed document 2/47: Page 1
Processed document 3/47: Page 1
Processed document 4/47: Page break
Processed document 5/47: Page 2
Processed document 6/47: Page break
Processed document 7/47: Page 3
Processed document 8/47: Page 3
Processed document 9/47: Page 3
Processed document 10/47: Page 3
Processed document 11/47: Page 3
Processed document 12/47: Page break
Processed document 13/47: Page 4
Processed document 14/47: Page 4
Processed document 15/47: Page 4
Processed document 16/47: Page break
Processed document 17/47: Page 5
Processed document 18/47: Page 5
Processed document 19/47: Page break
Processed document 20/47: Page 6
Processed document 21/47: Page 6
Processed document 22/47: Page 6
Processed document 23/47: Page 6
Processed document 24/47: Page 6
Processed document 25/47: Page 6
Processed document 26/47: Page 6
Pr

In [15]:
ppt_data

{1: '\nMachine Learning Model Deployment\nIntroduction to ML Pipeline\nhttps://bit.ly/bert_nlp',
 2: '\nWhat is Machine Learning Pipeline?',
 3: '\nType of ML Deployment\nBatch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.\nStream: Stream deployment enables ML models to process and analyze data in real-time as it flows in, suitable for applications like fraud detection or live social media analysis.\nRealtime: Realtime deployment allows ML models to provide instant predictions or decisions in response to incoming data, essential for use cases like recommendation systems or autonomous driving.\nEdge: Edge deployment involves running ML models on local devices close to the data source, reducing latency and bandwidth usage, which is crucial for IoT applications and smart devices.',
 4: '\nInfrastructure and Integration\nHardware and Software: Setting up the right environment for model d

In [16]:
context = ""

for page_number, page_content in ppt_data.items():
    context += f"### Page-{page_number}{page_content}\n\n"

In [17]:
print(context)

### Page-1
Machine Learning Model Deployment
Introduction to ML Pipeline
https://bit.ly/bert_nlp

### Page-2
What is Machine Learning Pipeline?

### Page-3
Type of ML Deployment
Batch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.
Stream: Stream deployment enables ML models to process and analyze data in real-time as it flows in, suitable for applications like fraud detection or live social media analysis.
Realtime: Realtime deployment allows ML models to provide instant predictions or decisions in response to incoming data, essential for use cases like recommendation systems or autonomous driving.
Edge: Edge deployment involves running ML models on local devices close to the data source, reducing latency and bandwidth usage, which is crucial for IoT applications and smart devices.

### Page-4
Infrastructure and Integration
Hardware and Software: Setting up the right environment for m

In [18]:
def build_context_from_ppt_data(ppt_data: dict, verbose: bool = False) -> str:
    """
    Builds a formatted context string from PowerPoint data organized by page number.

    Args:
        ppt_data (dict): A dictionary where keys are page numbers (int) and 
                         values are the corresponding page content (str).
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        str: A formatted context string with page information.

    Raises:
        ValueError: If `ppt_data` is empty or not a dictionary.
    """
    if not isinstance(ppt_data, dict):
        raise ValueError("Invalid input: `ppt_data` must be a dictionary.")
    if not ppt_data:
        raise ValueError("Invalid input: `ppt_data` cannot be empty.")

    if verbose:
        print("Building context from PowerPoint data...")

    context = ""

    for page_number, page_content in sorted(ppt_data.items()):
        if not isinstance(page_number, int):
            if verbose:
                print(f"Skipping invalid page number: {page_number}")
            continue
        if not isinstance(page_content, str):
            if verbose:
                print(f"Skipping invalid page content for page {page_number}")
            continue

        if verbose:
            print(f"Adding content for Page-{page_number}...")

        context += f"### Page-{page_number}\n{page_content.strip()}\n\n"

    if not context:
        raise ValueError("Context generation failed: No valid content found in `ppt_data`.")

    if verbose:
        print("Context generation complete.")
    
    return context


In [19]:
context = build_context_from_ppt_data(ppt_data=ppt_data,verbose=True)

Building context from PowerPoint data...
Adding content for Page-1...
Adding content for Page-2...
Adding content for Page-3...
Adding content for Page-4...
Adding content for Page-5...
Adding content for Page-6...
Adding content for Page-7...
Adding content for Page-8...
Adding content for Page-9...
Context generation complete.


In [20]:
print(context)

### Page-1
Machine Learning Model Deployment
Introduction to ML Pipeline
https://bit.ly/bert_nlp

### Page-2
What is Machine Learning Pipeline?

### Page-3
Type of ML Deployment
Batch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.
Stream: Stream deployment enables ML models to process and analyze data in real-time as it flows in, suitable for applications like fraud detection or live social media analysis.
Realtime: Realtime deployment allows ML models to provide instant predictions or decisions in response to incoming data, essential for use cases like recommendation systems or autonomous driving.
Edge: Edge deployment involves running ML models on local devices close to the data source, reducing latency and bandwidth usage, which is crucial for IoT applications and smart devices.

### Page-4
Infrastructure and Integration
Hardware and Software: Setting up the right environment for m

In [1]:
from scripts import ask_llm

In [21]:
question ="""
For each PowerPoint slide provided above, write a 2-minute script that effectively conveys the key points.
Ensure a smooth flow between slides, maintaining a clear and engaging narrative.
"""

response = ask_llm(context=context,
                   question=question)

print(response)

Here is a script for a 2-minute presentation based on the provided PowerPoint slides:

**Slide 1: Introduction to ML Pipeline**

"Good morning everyone, and welcome to our presentation on Machine Learning Model Deployment. As we all know, machine learning has revolutionized the way we approach complex problems in various industries. But have you ever wondered what it takes to get a machine learning model from development to production? Today, we'll take you through the journey of ML pipeline deployment, and explore the key considerations that come with it."

**Slide 2: What is Machine Learning Pipeline?**

"So, what exactly is an ML pipeline? Simply put, an ML pipeline refers to the process of building, testing, and deploying a machine learning model. It's like a manufacturing line for AI models. Just as a car goes through various stages of production, from design to assembly, an ML pipeline takes a model through different stages of development, validation, and deployment."

**Slide 3:

In [22]:
with open("data/ppt_script.md", "w") as f:
    f.write(response)

#### Final code

In [30]:
import os

from langchain_community.document_loaders import UnstructuredPowerPointLoader
from langchain.schema import Document

from scripts import ask_llm


def extract_ppt_data(file_path: str, mode: str = 'elements', verbose: bool = False) -> dict:
    """
    Extracts content from a PowerPoint file and organizes it by page number.

    Args:
        file_path (str): Path to the PowerPoint file.
        mode (str): Mode for loading the PowerPoint file. Default is 'elements'.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        dict: A dictionary where keys are page numbers and values are the concatenated content of each page.

    Raises:
        FileNotFoundError: If the file path is invalid.
        ValueError: If the loader fails to process the file.
    """
    if verbose:
        print(f"Initializing UnstructuredPowerPointLoader with file: {file_path} and mode: {mode}...")

    try:
        loader = UnstructuredPowerPointLoader(file_path=file_path, mode=mode)
        docs = loader.load()
        if verbose:
            print(f"Successfully loaded {len(docs)} documents.")
    except FileNotFoundError:
        raise FileNotFoundError(f"The file '{file_path}' does not exist.")
    except Exception as e:
        raise ValueError(f"Failed to load PowerPoint file. Error: {str(e)}")

    ppt_data = {}

    for idx, doc in enumerate(docs, start=1):
        if isinstance(doc, Document):
            page_number = doc.metadata.get('page_number', None)
            if page_number:
                ppt_data[page_number] = ppt_data.get(page_number, '') + '\n' + doc.page_content
            if verbose:
                print(f"Processed document {idx}/{len(docs)}: Page {page_number if page_number else 'break'}")

    if verbose:
        print("Extraction complete.")

    return ppt_data

def build_context_from_ppt_data(ppt_data: dict, verbose: bool = False) -> str:
    """
    Builds a formatted context string from PowerPoint data organized by page number.

    Args:
        ppt_data (dict): A dictionary where keys are page numbers (int) and 
                         values are the corresponding page content (str).
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        str: A formatted context string with page information.

    Raises:
        ValueError: If `ppt_data` is empty or not a dictionary.
    """
    if not isinstance(ppt_data, dict):
        raise ValueError("Invalid input: `ppt_data` must be a dictionary.")
    if not ppt_data:
        raise ValueError("Invalid input: `ppt_data` cannot be empty.")

    if verbose:
        print("Building context from PowerPoint data...")

    context = ""

    for page_number, page_content in sorted(ppt_data.items()):
        if not isinstance(page_number, int):
            if verbose:
                print(f"Skipping invalid page number: {page_number}")
            continue
        if not isinstance(page_content, str):
            if verbose:
                print(f"Skipping invalid page content for page {page_number}")
            continue

        if verbose:
            print(f"Adding content for Page-{page_number}...")

        context += f"### Page-{page_number}\n{page_content.strip()}\n\n"

    if not context:
        raise ValueError("Context generation failed: No valid content found in `ppt_data`.")

    if verbose:
        print("Context generation complete.")
    
    return context


def get_script_for_ppt(
    file_path: str,
    mode: str,
    question: str,
    save_and_display: bool = False,
    save: bool = False,
    save_file_path: str = "data/ppt_script.md",
    verbose: bool = False
) -> str:
    """
    Generates a script for a PowerPoint presentation by extracting data, building context, and querying an LLM.

    Args:
        file_path (str): Path to the PowerPoint file.
        mode (str): Mode for loading the PowerPoint file (e.g., 'elements').
        question (str): Question to query the LLM with.
        save_and_display (bool): If True, saves the response to a file and returns it. Default is False.
        save (bool): If True, saves the response to a file. Ignored if `save_and_display` is True. Default is False.
        save_file_path (str): Path to save the generated script. Default is "data/ppt_script.md".
        verbose (bool): If True, displays progress and debugging messages. Default is False.

    Returns:
        str: The generated response from the LLM if not saved.

    Raises:
        FileNotFoundError: If the PowerPoint file does not exist.
        ValueError: If an invalid mode or empty question is provided.
        Exception: For other unexpected errors.
    """
    try:
        if verbose:
            print("Starting script generation...")

        # Validate inputs
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"The specified PowerPoint file does not exist: {file_path}")
        if not isinstance(mode, str) or not mode.strip():
            raise ValueError("Invalid mode provided. Mode must be a non-empty string.")
        if not isinstance(question, str) or not question.strip():
            raise ValueError("Invalid question provided. Question must be a non-empty string.")

        if verbose:
            print(f"Extracting data from file: {file_path} using mode: {mode}...")

        # Extract PowerPoint data
        ppt_data = extract_ppt_data(file_path=file_path, mode=mode, verbose=verbose)

        if verbose:
            print("Building context from extracted data...")

        # Build context
        context = build_context_from_ppt_data(ppt_data=ppt_data, verbose=verbose)

        if verbose:
            print("Querying the LLM for response...")

        # Query LLM
        response = ask_llm(context=context, question=question)

        # Handle saving and/or returning the response
        if save_and_display or save:
            os.makedirs(os.path.dirname(save_file_path), exist_ok=True)
            with open(save_file_path, "w") as f:
                f.write(response)
            if verbose:
                print(f"Response saved to: {save_file_path}")

        if save_and_display:
            return response
        elif not save:
            return response

        if verbose:
            print("Script generation complete.")

    except FileNotFoundError as fnf_error:
        raise FileNotFoundError(f"File error: {fnf_error}")
    except ValueError as val_error:
        raise ValueError(f"Validation error: {val_error}")
    except Exception as e:
        raise Exception(f"An unexpected error occurred: {str(e)}")


In [28]:
file_path = "data/ml_course.pptx"
mode = 'elements'
question ="""For each PowerPoint slide provided above, write a 2-minute script that effectively conveys the key points.
    Ensure a smooth flow between slides, maintaining a clear and engaging narrative.
    """
save = False
save_file_path = "data/ppt_script.md"


script = get_script_for_ppt(file_path=file_path,mode=mode,question=question,save_and_display=True,save_file_path=save_file_path,verbose=True)

print(script)

Starting script generation...
Extracting data from file: data/ml_course.pptx using mode: elements...
Initializing UnstructuredPowerPointLoader with file: data/ml_course.pptx and mode: elements...
Successfully loaded 47 documents.
Processed document 1/47: Page 1
Processed document 2/47: Page 1
Processed document 3/47: Page 1
Processed document 4/47: Page break
Processed document 5/47: Page 2
Processed document 6/47: Page break
Processed document 7/47: Page 3
Processed document 8/47: Page 3
Processed document 9/47: Page 3
Processed document 10/47: Page 3
Processed document 11/47: Page 3
Processed document 12/47: Page break
Processed document 13/47: Page 4
Processed document 14/47: Page 4
Processed document 15/47: Page 4
Processed document 16/47: Page break
Processed document 17/47: Page 5
Processed document 18/47: Page 5
Processed document 19/47: Page break
Processed document 20/47: Page 6
Processed document 21/47: Page 6
Processed document 22/47: Page 6
Processed document 23/47: Page 6


***

### Project 2: Excel Data Analysis with LLM 
**Note:** Currently LLMs are not good in Math and Data Analysis.


In [31]:
from langchain_community.document_loaders import UnstructuredExcelLoader

In [32]:
loader = UnstructuredExcelLoader(file_path="data/sample.xlsx",mode='elements')

docs = loader.load()

In [34]:
len(docs)

1

In [37]:
docs[0].page_content

'\n\n\nFirst Name\nLast Name\nCity\nGender\n\n\nBrandon\nJames\nMiami\nM\n\n\nSean\nHawkins\nDenver\nM\n\n\nJudy\nDay\nLos Angeles\nF\n\n\nAshley\nRuiz\nSan Francisco\nF\n\n\nStephanie\nGomez\nPortland\nF\n\n\n'

In [38]:
docs[0].metadata

{'source': 'data/sample.xlsx',
 'file_directory': 'data',
 'filename': 'sample.xlsx',
 'last_modified': '2024-11-03T21:48:01',
 'page_name': 'Data',
 'page_number': 1,
 'text_as_html': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>First Name</td>\n      <td>Last Name</td>\n      <td>City</td>\n      <td>Gender</td>\n    </tr>\n    <tr>\n      <td>Brandon</td>\n      <td>James</td>\n      <td>Miami</td>\n      <td>M</td>\n    </tr>\n    <tr>\n      <td>Sean</td>\n      <td>Hawkins</td>\n      <td>Denver</td>\n      <td>M</td>\n    </tr>\n    <tr>\n      <td>Judy</td>\n      <td>Day</td>\n      <td>Los Angeles</td>\n      <td>F</td>\n    </tr>\n    <tr>\n      <td>Ashley</td>\n      <td>Ruiz</td>\n      <td>San Francisco</td>\n      <td>F</td>\n    </tr>\n    <tr>\n      <td>Stephanie</td>\n      <td>Gomez</td>\n      <td>Portland</td>\n      <td>F</td>\n    </tr>\n  </tbody>\n</table>',
 'languages': ['eng'],
 'filetype': 'application/vnd.openxmlformats-officedoc

In [39]:
context = docs[0].metadata['text_as_html']

In [40]:
print(context)

<table border="1" class="dataframe">
  <tbody>
    <tr>
      <td>First Name</td>
      <td>Last Name</td>
      <td>City</td>
      <td>Gender</td>
    </tr>
    <tr>
      <td>Brandon</td>
      <td>James</td>
      <td>Miami</td>
      <td>M</td>
    </tr>
    <tr>
      <td>Sean</td>
      <td>Hawkins</td>
      <td>Denver</td>
      <td>M</td>
    </tr>
    <tr>
      <td>Judy</td>
      <td>Day</td>
      <td>Los Angeles</td>
      <td>F</td>
    </tr>
    <tr>
      <td>Ashley</td>
      <td>Ruiz</td>
      <td>San Francisco</td>
      <td>F</td>
    </tr>
    <tr>
      <td>Stephanie</td>
      <td>Gomez</td>
      <td>Portland</td>
      <td>F</td>
    </tr>
  </tbody>
</table>


In [44]:
from IPython.display import HTML

In [45]:
HTML(context)

0,1,2,3
First Name,Last Name,City,Gender
Brandon,James,Miami,M
Sean,Hawkins,Denver,M
Judy,Day,Los Angeles,F
Ashley,Ruiz,San Francisco,F
Stephanie,Gomez,Portland,F


In [46]:
question = "Return the Data in Markdown format."

response = ask_llm(context=context,
                   question=question)

print(response)

| First Name | Last Name | City | Gender |
| --- | --- | --- | --- |
| Brandon | James | Miami | M |
| Sean | Hawkins | Denver | M |
| Judy | Day | Los Angeles | F |
| Ashley | Ruiz | San Francisco | F |
| Stephanie | Gomez | Portland | F |


In [52]:
question = "Return all entris in the table where Gender is 'F'. Format the response in Markdown. Do not write preambles and explanation."

response = ask_llm(context=context,
                   question=question)

print(response)

| First Name | Last Name | City    | Gender |
|:-----------|:----------|:---------|:-------|
| Judy       | Day        | Los Angeles| F       |
| Ashley     | Ruiz       | San Francisco| F       |
| Stephanie  | Gomez      | Portland   | F       |


In [51]:
question = "Return all entris in the table where Gender is 'male'. Format the response in Markdown. Do not write preambles and explanation."

response = ask_llm(context=context,
                   question=question)

print(response)

| First Name | Last Name | City        | Gender |
|:-----------|:----------|:------------|:-------|
| Brandon    | James     | Miami       | M       |
| Sean       | Hawkins   | Denver      | M       |


***

### Project 3: Personalized Job Application Letter

In [54]:
# !pip install -U docx2txt

In [55]:
from langchain_community.document_loaders import  Docx2txtLoader

In [56]:
loader = Docx2txtLoader("data/job_description.docx")

docs = loader.load()

In [57]:
len(docs)

1

In [58]:
docs[0].metadata

{'source': 'data/job_description.docx'}

In [59]:
context = docs[0].page_content

In [60]:
print(context)

Job Description - Data Scientist

At SpiceJet, we rely on data to provide us valuable insights, and to automate our systems and solutions to help us increase revenues, reduce costs and provide improved customer experiences. We are seeking an experienced data scientist to deliver insights and automate our systems and processes. Ideal team member will have mathematical and statistical expertise, experience with modern data science programming languages and machine learning/AI platforms and techniques. You will mine, clean and interpret our data and then develop machine learning models to deliver business value across different parts of the business. 

Objectives of this Role

Use Data Science and Machine Learning to increase revenue, reduce costs and increase customer satisfaction.

Collaborate with product design and engineering to develop an understanding of needs

Understand where the required data resides and work on ways to extract the relevant data.

Research and devise statistical

In [63]:
question ="""
My name is Aaditya, and I am a recent graduate from IIT with a focus on Natural Language Processing and Machine Learning.
I am applying for a Data Scientist position at SpiceJet.
Please write a concise job application email for me in short(about 100-150 words), removing any placeholders, including references to job boards or sources.
"""

response = ask_llm(context,question)

print(response)

Subject: Application for Data Scientist Position at SpiceJet

Dear Hiring Manager,

I am excited to apply for the Data Scientist position at SpiceJet. As a recent graduate from IIT with a focus on Natural Language Processing and Machine Learning, I am confident in my ability to deliver insights and automate systems using data science and machine learning techniques.

With a solid foundation in mathematical and statistical expertise, experience with modern programming languages such as Python and R, and proficiency in machine learning platforms and techniques, I am well-equipped to drive business value across different parts of the organization. My strong understanding of NLP and ML concepts enables me to apply algorithms to make smarter and intelligent data-driven systems.

I am eager to collaborate with product design and engineering teams to develop an understanding of needs and devise statistical and machine learning models that drive revenue growth, reduce costs, and enhance custom