In [2]:
summary_text = '''
In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.

As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant.

For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks are too small or too large, it may lead to imprecise search results or missed opportunities to surface relevant content. As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will make sense to the language model as well. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.

Another example is conversational agents (which we covered before using Python and Javascript). We use the embedded chunks to build the context for the conversational agent based on a knowledge base that grounds the agent in trusted information. In this situation, it’s important to make the right choice about our chunking strategy for two reasons: First, it will determine whether the context is actually relevant to our prompt. Second, it will determine whether or not we’ll be able to fit the retrieved text into the context before sending it to an outside model provider (e.g., OpenAI), given the limitations on the number of tokens we can send for each request. In some cases, like when using GPT-4 with a 32k context window, fitting the chunks might not be an issue. Still, we need to be mindful of when we’re using very big chunks, as this may adversely affect the relevancy of the results we get back from Pinecone.

In this post, we’ll explore several chunking methods and discuss the tradeoffs you should think about when choosing a chunking size and method. Finally, we’ll give some recommendations for determining the best chunk size and method that will be appropriate for your application.

Start using Pinecone for free
Pinecone is the developer-favorite vector database that's fast and easy to use at any scale.
Email address
Subscribe
Embedding short and long content
When we embed our content, we can anticipate distinct behaviors depending on whether the content is short (like sentences) or long (like paragraphs or entire documents).

When a sentence is embedded, the resulting vector focuses on the sentence’s specific meaning. The comparison would naturally be done on that level when compared to other sentence embeddings. This also implies that the embedding may miss out on broader contextual information found in a paragraph or document.

When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text. Larger input text sizes, on the other hand, may introduce noise or dilute the significance of individual sentences or phrases, making finding precise matches when querying the index more difficult.

The length of the query also influences how the embeddings relate to one another. A shorter query, such as a single sentence or phrase, will concentrate on specifics and may be better suited for matching against sentence-level embeddings. A longer query that spans more than one sentence or a paragraph may be more in tune with embeddings at the paragraph or document level because it is likely looking for broader context or themes.

The index may also be non-homogeneous and contain embeddings for chunks of varying sizes. This may pose challenges in terms of query result relevance, but it may also have some positive consequences. On the one hand, the relevance of the query result may fluctuate because of discrepancies between the semantic representations of long and short content. On the other, a non-homogeneous index could potentially capture a wider range of context and information since different chunk sizes represent different levels of granularity in the text. This could accommodate different types of queries more flexibly.

Chunking Considerations
Several variables play a role in determining the best chunking strategy, and these variables vary depending on the use case. Here are some key aspects to keep in mind:

What is the nature of the content being indexed? Are you working with long documents, such as articles or books, or shorter content, like tweets or instant messages? The answer would dictate both which model would be more suitable for your goal and, consequently, what chunking strategy to apply.
Which embedding model are you using, and what chunk sizes does it perform optimally on? For instance, sentence-transformer models work well on individual sentences, but a model like text-embedding-ada-002 performs better on chunks containing 256 or 512 tokens.
What are your expectations for the length and complexity of user queries? Will they be short and specific or long and complex? This may inform the way you choose to chunk your content as well so that there’s a closer correlation between the embedded query and embedded chunks.
How will the retrieved results be utilized within your specific application? For example, will they be used for semantic search, question answering, summarization, or other purposes? For example, if your results need to be fed into another LLM with a token limit, you’ll have to take that into consideration and limit the size of the chunks based on the number of chunks you’d like to fit into the request to the LLM.
Answering these questions will allow you to develop a chunking strategy that balances performance and accuracy, and this, in turn, will ensure the query results are more relevant.

'''

In [4]:
# Word Count non overlapping chunking

In [8]:
def chunk_by_word_count(text, max_word_count):
    words = text.split()
    chunks = [words[i:i + max_word_count] for i in range(0, len(words), max_word_count)]
    return [' '.join(chunk) for chunk in chunks]

# Example usage:
text = summary_text
max_word_count = 50
chunks = chunk_by_word_count(text, max_word_count)
for chunk in chunks:
  print(chunk)
  print("-"* 20)


In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this
--------------------
blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications. As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as
--------------------
possible that is still semantically relevant. For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks
--------------------
are too small or too large, it

In [29]:
# Word Count overlapping/non-overlapping chunking

In [19]:
#!pip install nltk

import nltk
nltk.download('punkt')
import warnings
warnings.filterwarnings("ignore")

def chunk_text(text, chunk_size, overlap):
    chunks = []
    words = nltk.word_tokenize(text)
    start = 0
    while start < len(words):
        end = start + chunk_size
        if end > len(words):
            end = len(words)
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks


# Example usage
text = summary_text
chunk_size = 200  # Number of words in each chunk
overlap = 50      # Number of words for overlapping
chunks = chunk_text(text, chunk_size, overlap)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: {chunk}")

Chunk 1: In the context of building LLM-related applications , chunking is the process of breaking down large pieces of text into smaller segments . It ’ s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content . In this blog post , we ’ ll explore if and how it helps improve efficiency and accuracy in LLM-related applications . As we know , any content that we index in Pinecone needs to be embedded first . The main reason for chunking is to ensure we ’ re embedding a piece of content with as little noise as possible that is still semantically relevant . For example , in semantic search , we index a corpus of documents , with each document containing valuable information on a specific topic . By applying an effective chunking strategy , we can ensure our search results accurately capture the essence of the user ’ s query . If our chunks are too small or too large , it may lead to imprecise search r

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
#Char count overlapping/nonoverlapping chunking

In [32]:
def chunk_text_by_characters(text, chunk_size, overlap_size):
    chunks = []
    step_size = chunk_size - overlap_size
    for i in range(0, len(text) - chunk_size + 1, step_size):
        chunk = text[i:i+chunk_size]
        chunks.append(chunk)
    return chunks

# Example usage
text = summary_text
chunk_size = 200  # Number of characters in each chunk
overlap_size = 10  # Number of overlapping characters between chunks
chunks = chunk_text_by_characters(text, chunk_size, overlap_size)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: {chunk}")

Chunk 1: 
In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance
Chunk 2:  relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related a
Chunk 3: -related applications.

As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise 
Chunk 4: tle noise as possible that is still semantically relevant.

For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By 
Chunk 5: topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks a

In [None]:
#Char count overlapping/nonoverlapping chunking using langchain

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the parameters
chunk_size = 200  #number of characters
chunk_overlap = 50

# Instantiate the RecursiveCharacterTextSplitter class
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

# Define the text (Example text provided)
text = summary_text

# Create documents using the text splitter
docs = text_splitter.create_documents([text])

# Output the results
for i, doc in enumerate(docs):
    print(f"Document {i + 1}:")
    print(doc)


Document 1:
page_content='In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance'
Document 2:
page_content='technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve'
Document 3:
page_content='post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.'
Document 4:
page_content='As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is'
Document 5:
page_content='content with as little noise as possible that is still semantically relevant.'
Document 6:
page_content='For example, in semantic search, we index a corpus of documents, with e

In [34]:
#Custom/Natural Delimiter chunking with text overlapping

In [40]:
def custom_delimiter_chunker_with_overlap(text, delimiter, max_chunk_size, overlap_size):
    # Split the text by the delimiter
    segments = text.split(delimiter)

    chunks = []
    current_chunk = []

    for segment in segments:
        # Check if adding this segment exceeds the max chunk size
        if sum(len(s) + len(delimiter) for s in current_chunk) + len(segment) > max_chunk_size:
            # Join the current chunk into a single string and add it to the list of chunks
            chunk_str = delimiter.join(current_chunk)
            chunks.append(chunk_str)

            # Create the overlap portion
            overlap_portion = chunk_str[-overlap_size:]

            # Start a new chunk with the overlap portion and the current segment
            current_chunk = [overlap_portion, segment]
        else:
            # Add the segment to the current chunk
            current_chunk.append(segment)

    # Add any remaining segments as the last chunk
    if current_chunk:
        chunks.append(delimiter.join(current_chunk))

    return chunks

# Example usage:
text = summary_text

delimiter = "\n\n"  # Double newlines as the delimiter
max_chunk_size = 200  # Maximum chunk size
overlap_size = 30  # Overlap size in characters

chunks = custom_delimiter_chunker_with_overlap(text, delimiter, max_chunk_size, overlap_size)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 20)


Chunk 1:

--------------------
Chunk 2:



In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.
--------------------
Chunk 3:
y in LLM-related applications.

As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant.
--------------------
Chunk 4:
s still semantically relevant.

For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search 

In [35]:
# Install langchain library
#!pip install langchain

from langchain.text_splitter import CharacterTextSplitter

# Define the parameters
delimeter = "\n\n"  # Separator to use for splitting
chunk_size = 200    # Maximum chunk size
chunk_overlap = 50  # Overlap between chunks

# Instantiate the CharacterTextSplitter class
text_splitter = CharacterTextSplitter(separator=delimeter , chunk_size=chunk_size, chunk_overlap=chunk_overlap)

# Define the text (Example text provided)
text =summary_text

# Create documents using the text splitter
docs = text_splitter.create_documents([text])

# Output the results
for i, doc in enumerate(docs):
    print(f"Document {i + 1}:")
    print(doc)



Document 1:
page_content='In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.'
Document 2:
page_content='As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant.'
Document 3:
page_content='For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks are too small

In [None]:
#Custom/Natural Delimiter Chunker with Word Overlapping

In [41]:
def custom_delimiter_chunker_with_word_overlap(text, delimiter, max_chunk_size, overlap_size):
    # Split the text by the delimiter
    segments = text.split(delimiter)

    # To hold the final chunks
    chunks = []
    current_chunk = []
    current_chunk_size = 0

    for segment in segments:
        segment_words = segment.split()
        segment_size = len(segment_words)

        if current_chunk_size + segment_size > max_chunk_size:
            # Join the current chunk into a single string and add it to the list of chunks
            chunk_str = delimiter.join(' '.join(current_chunk[i]) for i in range(len(current_chunk)))
            chunks.append(chunk_str)

            # Create the overlap portion
            overlap_portion = current_chunk[-overlap_size:] if overlap_size <= len(current_chunk) else current_chunk
            overlap_portion = [word for sublist in overlap_portion for word in sublist]  # Flatten the list

            # Start a new chunk with the overlap portion and the current segment
            current_chunk = [overlap_portion, segment_words]
            current_chunk_size = len(overlap_portion) + segment_size
        else:
            # Add the segment to the current chunk
            current_chunk.append(segment_words)
            current_chunk_size += segment_size

    # Add any remaining segments as the last chunk
    if current_chunk:
        chunk_str = delimiter.join(' '.join(current_chunk[i]) for i in range(len(current_chunk)))
        chunks.append(chunk_str)

    return chunks

# Example usage:
text =summary_text
delimiter = "\n\n"  # Double newlines as the delimiter
max_chunk_size = 40  # Maximum chunk size in words
overlap_size = 10    # Overlap size in words

chunks = custom_delimiter_chunker_with_word_overlap(text, delimiter, max_chunk_size, overlap_size)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 20)


Chunk 1:

--------------------
Chunk 2:


In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.
--------------------
Chunk 3:
In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.

As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’

In [None]:
#semantic section chunking

In [42]:
import torch
from transformers import GPT2Tokenizer, GPT2Model

def semantic_section_chunking(text, max_chunk_size=200, overlap_size=50):
    # Load pre-trained GPT-2 model and tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    model = GPT2Model.from_pretrained("gpt2")

    # Tokenize the text
    input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=False)

    # Define the chunking parameters
    chunk_size = max_chunk_size - overlap_size
    stride = chunk_size

    # Perform chunking
    chunks = []
    for i in range(0, input_ids.size(1), stride):
        # Slice the input_ids to form a chunk
        chunk_input_ids = input_ids[:, i:i+chunk_size]

        # Decode the chunk
        chunk_text = tokenizer.decode(chunk_input_ids[0], skip_special_tokens=True)

        # Add the chunk to the list
        chunks.append(chunk_text)

    return chunks

# Example usage:
text = summary_text

max_chunk_size = 200  # Maximum chunk size in characters
overlap_size = 50     # Overlap size in characters

chunks = semantic_section_chunking(text, max_chunk_size=max_chunk_size, overlap_size=overlap_size)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 20)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1300 > 1024). Running this sequence through the model will result in indexing errors


Chunk 1:

In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.

As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant.

For example, in semantic search, we index a
--------------------
Chunk 2:
 corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks are too small or too large, it may lead to impreci

In [None]:
#semantic section chunking

#https://medium.com/the-ai-forum/semantic-chunking-for-rag-f4733025d5f5#:~:text=Semantic%20chunking%20involves%20taking%20the,the%20most%20similar%20embeddings%20together.

In [1]:
#!pip install -qU langchain_experimental langchain_openai langchain_community langchain ragas chromadb langchain-groq fastembed pypdf openai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/199.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m194.6/199.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.5/199.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
#
loader = PyPDFLoader("/content/drive/MyDrive/doc-summary/dataset'/SUBBARAOGOGULAMUDI 4Y_6M.pdf")
documents = loader.load()
#
print(len(documents))

5


In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)
#
naive_chunks = text_splitter.split_documents(documents)
for chunk in naive_chunks[10:15]:
  print(chunk.page_content+ "\n")

✔ Excellent understanding Data Preprocessing steps, Getting the Dataset, Importing Libraries, 
Importing Datasets, Finding Missing Data, Encoding Categorical Data, LableEncoder,

OneHotEncoder, Splitting Dataset into Training and Test Set, Feature Scaling, Standardization & 
Normalization.

✔ Skilled in Classification Algorithms with Linear Models: Logistic Regres sion, Support Vector 
Machines, Non -linear Models: K -Nearest Neighbors, Naïve Bayes, Decision Tree Classification,

Random Forest Classification, Kernel SVM.  
 
PROFESSIONAL EXPERIENCE:   
 
● Working as a Data scientist in Curl Technology , Banglore  from 06th June to till date.

● Worked as Data scientist in CODESETS IT Solutions, Hyderabad,  from 3rd Mar 2018 to to 
03rd June 2022.  
 
Roles & Responsibilities:  
Project Name:  OCR Evolution  
Client   : Natixis



In [6]:
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model_optimized.onnx:   0%|          | 0.00/218M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/740 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

In [8]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_chunker = SemanticChunker(embed_model, breakpoint_threshold_type="percentile")
#
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])
#


In [13]:
for semantic_chunk in semantic_chunks:
    print(semantic_chunk.page_content)
    #print(len(semantic_chunk.page_content))
    print("_" * 100)

SUBBARAO GOGULAMUDI  
DATA SCIENTIST  
Mobile: + 91 -8522881740                                                  H No:15 -520, GOLLAPALEM VILLAGE, INKOLLU  
Email : SUBBARAOGOGULAMUDIDS@GMAIL.COM            MANDAL, PRAKASAM DISTRICT,  A P - 523167  
OBJECTIVE : 
To be a continuous value addition to the organization, to work in an innovative and competitive world, 
intend to build a career with leading corporate with committed and dedicated people, which will help me 
to explore myself and realize my potential to the  fullest. PROFESSIONAL EXPERIENCE:   
 
✔ Professional qualified Data Scientist/Data Analyst with around 4.6 years of experience in Data 
Science and Analytics, including Data Mining, Machine Learning, Statistical Analysis and SQL.
____________________________________________________________________________________________________
✔ Involved in the entire data science life cycle and actively engaged in all the phases, including data 
cleaning, data extraction, and data visu