## Q. What are the most common prompts that developers ask ChatGPT? How does ChatGPT answer those prompts - Code or No Code?


In [2]:
# Importing all necessary packages
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import string
import networkx as nx
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Ensure necessary components are downloaded
# nltk.download('punkt')
# nltk.download('stopwords')

In [3]:
# Normalizing the JSON file because of the complexity it has
data = pd.json_normalize(pd.read_json("20231012_235128_issue_sharings.json")["Sources"])
data.columns

Index(['Type', 'URL', 'Author', 'RepoName', 'RepoLanguage', 'Number', 'Title',
       'Body', 'CreatedAt', 'ClosedAt', 'UpdatedAt', 'State',
       'ChatgptSharing'],
      dtype='object')

**Considering the Issue Sharings file to test the code.**

In [4]:
data = pd.read_json("20231012_235128_issue_sharings.json")
data = pd.json_normalize(data['Sources'])
chatgpt_sharing_data = pd.json_normalize(data['ChatgptSharing'].explode())
chatgpt_sharing_data.columns

Index(['URL', 'Status', 'DateOfConversation', 'DateOfAccess', 'Title',
       'NumberOfPrompts', 'TokensOfPrompts', 'TokensOfAnswers', 'Model',
       'Conversations', 'HTMLContent', 'Mention.MentionedURL',
       'Mention.MentionedProperty', 'Mention.MentionedAuthor',
       'Mention.MentionedText'],
      dtype='object')

In [5]:
repo_lang = list(set(data["RepoLanguage"]))
print(*repo_lang)

Zig Rust PowerShell CSS R F# Swift Go Python Emacs Lisp MATLAB Verilog C# PLpgSQL Astro Clojure Kotlin Vue JavaScript Makefile None Java PHP Objective-C Shell C Groovy Nim Perl C++ Game Maker Language TeX TypeScript GLSL Ruby HTML Jupyter Notebook


In [6]:
data = pd.read_json("20231012_235128_issue_sharings.json")
data = pd.json_normalize(data['Sources'])
chatgpt_sharing_data = pd.json_normalize(data['ChatgptSharing'].explode())
first_column_data = chatgpt_sharing_data.iloc[:, 0]
first_column_df = pd.DataFrame(first_column_data)
expanded_chatgpt_sharing = data['ChatgptSharing'].apply(pd.Series).stack().reset_index(drop=True)
all_attributes_chatgpt_sharing = pd.json_normalize(expanded_chatgpt_sharing)
conversations_data = pd.json_normalize(all_attributes_chatgpt_sharing['Conversations'].explode())
conversations_data

Unnamed: 0,Prompt,Answer,ListOfCode
0,Raspberry Piをアクセスポイント化させて、Wi-Fi通信をスマホとさせました。Ra...,ChatGPTRaspberry Piをアクセスポイントとして設定し、スマホとのWi-Fi通...,[]
1,Raspberry Piには地図タイルが入っており、これをWi−Fi通信でスマホに配信します...,ChatGPT地図タイルを配信するためのRaspberry Piとスマホ間の接続テストを行う...,[]
2,上記で教えてくれたコマンドはsshでラズパイの中に入って打つっていう認識でよいよね？,ChatGPTはい、その通りです。提供されたコマンドのほとんどは、SSH (Secure S...,[]
3,iperfってどんなコマンドですか？中学生にもわかるように説明して,ChatGPT了解です！iperfは、コンピュータ間のネットワーク速度を計測するためのツール...,[]
4,どういった出力結果がでる？,ChatGPTiperfを使用すると、主に以下のような情報を得ることができます：帯域幅（Ba...,[]
...,...,...,...
2065,continue,ChatGPTUsed WebPilotThe version of the FSRS tu...,[]
2066,proceed,ChatGPTUsed WebPilotThe version of the FSRS tu...,[]
2067,I wish that in typescript I could mark a funct...,ChatGPTAs of my knowledge cutoff in September ...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '..."
2068,Are there no other options that use static ana...,ChatGPTAs of my knowledge cut-off in September...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '..."


## I asked ChatGPT what developers normally seek its assistance in.

Developers often approach ChatGPT with a wide range of requests, typically revolving around programming, debugging, learning concepts, and development tools. To identify these questions, you can focus on keywords that are common in such contexts. Here are some common types of requests and corresponding keywords:

1. **Coding Assistance**: 
   - **Keywords**: "code", "write", "script", "implement", "function", "algorithm".
   - **Example**: "Can you write a Python function to sort a list?"

2. **Debugging and Error Resolution**: 
   - **Keywords**: "debug", "error", "fix", "issue", "resolve", "exception", "traceback".
   - **Example**: "Help me debug this JavaScript error message I'm getting."

3. **Explanation of Concepts**:
   - **Keywords**: "explain", "understand", "concept", "how does", "what is", "theory", "principle".
   - **Example**: "Can you explain how recursion works in programming?"

4. **Learning and Educational Resources**:
   - **Keywords**: "learn", "tutorial", "course", "resource", "guide", "documentation", "example".
   - **Example**: "Recommend some resources to learn React."

5. **Best Practices and Design Patterns**:
   - **Keywords**: "best practice", "design pattern", "architecture", "efficient", "optimization", "structure", "model".
   - **Example**: "What are some best practices for REST API design?"

6. **Technology and Tool Recommendations**:
   - **Keywords**: "recommend", "tool", "library", "framework", "technology", "platform", "software".
   - **Example**: "Which libraries do you recommend for data visualization in Python?"

7. **Code Review and Optimization**:
   - **Keywords**: "review", "optimize", "improve", "enhance", "refactor", "performance", "efficiency".
   - **Example**: "Can you review and suggest improvements for my Ruby script?"

8. **Project Guidance and Planning**:
   - **Keywords**: "plan", "approach", "strategy", "project", "structure", "organize", "develop".
   - **Example**: "How should I plan the architecture for my mobile app project?"

9. **Integration and Configuration**:
   - **Keywords**: "integrate", "configure", "setup", "install", "deploy", "connect", "environment".
   - **Example**: "How do I integrate a payment gateway in my web application?"

10. **Version Control and Collaboration**:
    - **Keywords**: "git", "version control", "collaborate", "merge", "branch", "commit", "repository".
    - **Example**: "How do I resolve merge conflicts in Git?"

These keywords can help in identifying the nature of developers' questions. By focusing on these, you can categorize and address the specific needs of developer-related queries effectively.

In [7]:
keywords = ["code", "write", "script", "implement", "function", "algorithm", "debug", "error", 
            "fix", "issue", "resolve", "exception", "traceback", "explain", "understand", "concept", 
            "how does", "what is", "theory", "principle", "learn", "tutorial", "course", "resource", 
            "guide", "documentation", "example", "best practice", "design pattern", "architecture", 
            "efficient", "optimization", "structure", "model", "recommend", "tool", "library", 
            "framework", "technology", "platform", "software", "review", "optimize", "improve", 
            "enhance", "refactor", "performance", "efficiency", "plan", "approach", "strategy", 
            "project", "structure", "organize", "devmelop",  "integrate", "configure", "setup", 
            "install", "deploy", "connect", "environment", "git", "version control", "collaborate", 
            "merge", "branch", "commit", "repository", "generate", 'GLSL','PHP','PLpgSQL', 'JavaScript', 
            'Clojure', 'F#','Swift','C++','MATLAB', 'Objective-C', 'Rust', 'HTML' 'Go', 'Python', 'Verilog', 'C#', 'Groovy', 
            'Jupyter Notebook', 'PowerShell', 'C', 'Shell', 'TypeScript', 'Nim', 'Makefile', 'Perl', 'Vue', 'CSS', 'Astro', 
            'TeX', 'R', "Zig", "Ruby", "Java", "Emacs Lisp", "Kotlin", "Game Maker Language", "Python3", "Python2"]

In [8]:
conversations_data["Prompt"][12]

'Write me python3 script that takes in mp3 audio track and generate a very beautiful audio visualizer video'

In [9]:
def read_and_preprocess_text(text):
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)

    # Tokenize each sentence into words, remove stopwords and punctuation, and convert to lowercase
    stop_words = set(nltk.corpus.stopwords.words('english') + list(string.punctuation))
    tokenized_sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence) if word.isalnum() and word.lower() not in stop_words]
        for sentence in sentences
    ]

    return tokenized_sentences

def calculate_cosine_similarity(tfidf_matrix):
    # Calculate cosine similarity between sentences based on TF-IDF matrix
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return similarity_matrix

def generate_summary(text, num_sentences=3):
    # Preprocess the text
    tokenized_sentences = read_and_preprocess_text(text)

    # Convert sentences to TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([' '.join(sentence) for sentence in tokenized_sentences])

    # Calculate cosine similarity matrix
    similarity_matrix = calculate_cosine_similarity(tfidf_matrix)

    # Use the PageRank algorithm to rank sentences
    graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(graph)

    # Sort sentences by their scores and select the top ones
    ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(tokenized_sentences)), reverse=True)
    summary_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]

    # Join the summary sentences to form the final summary
    summary = ' '.join([' '.join(sentence) for sentence in summary_sentences])

    return summary

In [82]:
text = conversations_data["Prompt"][12]

summary = generate_summary(text, num_sentences=4).split()

In [83]:
print(summary)

['write', 'python3', 'script', 'takes', 'mp3', 'audio', 'track', 'generate', 'beautiful', 'audio', 'visualizer', 'video']


In [11]:
conversations_data["ListOfCode"][12]

[{'ReplaceString': '[CODE_BLOCK_0]',
  'Type': 'bash',
  'Content': 'pip install moviepy numpy scipy matplotlib\n'},
 {'ReplaceString': '[CODE_BLOCK_1]',
  'Type': 'python',
  'Content': 'import moviepy.editor as mp\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom moviepy.video.fx import fadein\n\n# Define your input MP3 audio file and output video file\naudio_file = "input_audio.mp3"\noutput_file = "output_visualizer.mp4"\n\n# Load the audio file\naudio = mp.AudioFileClip(audio_file)\n\n# Set the duration of the video to match the audio\nvideo = mp.VideoClip(lambda t: [0], duration=audio.duration)\n\n# Define a function to generate a frame for each moment in time\ndef make_frame(t):\n    # Get the audio data at the given time\n    audio_data = audio.to_soundarray()[int(t * audio.fps)]\n    \n    # Create a simple audio visualizer using matplotlib\n    plt.figure(figsize=(10, 4))\n    plt.specgram(audio_data, Fs=audio.fps, cmap="viridis")\n    plt.axis(\'off\')\n    \n    # S

In [84]:
text = conversations_data["Prompt"][12]
generate_summary(text, num_sentences=4).split()
[i for i in keywords if i in summary]

['write', 'script', 'generate']

In [14]:
conversations_data["ListOfCode"][12]

[{'ReplaceString': '[CODE_BLOCK_0]',
  'Type': 'bash',
  'Content': 'pip install moviepy numpy scipy matplotlib\n'},
 {'ReplaceString': '[CODE_BLOCK_1]',
  'Type': 'python',
  'Content': 'import moviepy.editor as mp\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom moviepy.video.fx import fadein\n\n# Define your input MP3 audio file and output video file\naudio_file = "input_audio.mp3"\noutput_file = "output_visualizer.mp4"\n\n# Load the audio file\naudio = mp.AudioFileClip(audio_file)\n\n# Set the duration of the video to match the audio\nvideo = mp.VideoClip(lambda t: [0], duration=audio.duration)\n\n# Define a function to generate a frame for each moment in time\ndef make_frame(t):\n    # Get the audio data at the given time\n    audio_data = audio.to_soundarray()[int(t * audio.fps)]\n    \n    # Create a simple audio visualizer using matplotlib\n    plt.figure(figsize=(10, 4))\n    plt.specgram(audio_data, Fs=audio.fps, cmap="viridis")\n    plt.axis(\'off\')\n    \n    # S

### Writing a function combining everything together

In [15]:
def read_and_preprocess_text(text):
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)

    # Tokenize each sentence into words, remove stopwords and punctuation, and convert to lowercase
    stop_words = set(nltk.corpus.stopwords.words('english') + list(string.punctuation))
    tokenized_sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence) if word.isalnum() and word.lower() not in stop_words]
        for sentence in sentences
    ]

    return tokenized_sentences

def calculate_cosine_similarity(tfidf_matrix):
    # Calculate cosine similarity between sentences based on TF-IDF matrix
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return similarity_matrix

def generate_summary(text, num_sentences=3):
    # Preprocess the text
    tokenized_sentences = read_and_preprocess_text(text)

    # Convert sentences to TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([' '.join(sentence) for sentence in tokenized_sentences])

    # Calculate cosine similarity matrix
    similarity_matrix = calculate_cosine_similarity(tfidf_matrix)

    # Use the PageRank algorithm to rank sentences
    graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(graph)

    # Sort sentences by their scores and select the top ones
    ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(tokenized_sentences)), reverse=True)
    summary_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]

    # Join the summary sentences to form the final summary
    summary = ' '.join([' '.join(sentence) for sentence in summary_sentences])

    return summary

# Example usage
text = conversations_data["Prompt"][12]

summary = generate_summary(text, num_sentences=4).split()

keywords = ["code", "write", "script", "implement", "function", "algorithm", "debug", "error", 
            "fix", "issue", "resolve", "exception", "traceback", "explain", "understand", "concept", 
            "how does", "what is", "theory", "principle", "learn", "tutorial", "course", "resource", 
            "guide", "documentation", "example", "best practice", "design pattern", "architecture", 
            "efficient", "optimization", "structure", "model", "recommend", "tool", "library", 
            "framework", "technology", "platform", "software", "review", "optimize", "improve", 
            "enhance", "refactor", "performance", "efficiency", "plan", "approach", "strategy", 
            "project", "structure", "organize", "develop",  "integrate", "configure", "setup", 
            "install", "deploy", "connect", "environment", "git", "version control", "collaborate", 
            "merge", "branch", "commit", "repository", "generate", 'GLSL','PHP','PLpgSQL', 'JavaScript', 
            'Clojure', 'F#','Swift','C++','MATLAB', 'Objective-C', 'Rust', 'HTML' 'Go', 'Python', 'Verilog', 'C#', 'Groovy', 
            'Jupyter Notebook', 'PowerShell', 'C', 'Shell', 'TypeScript', 'Nim', 'Makefile', 'Perl', 'Vue', 'CSS', 'Astro', 
            'TeX', 'R', "Zig", "Ruby", "Java", "Emacs Lisp", "Kotlin", "Game Maker Language", "Python3", "Python2"]

keywords_in_summary = []
for i in keywords:
    if i in summary:
        keywords_in_summary.append(i)
print("Keywords found in summary are: ", keywords_in_summary)
print("\n")
print("Did ChatGPT provide code or no-code solution:")
if keywords_in_summary != []:
    if conversations_data["ListOfCode"][12] != []:
        print("ChatGPT provided code for the given prompt.")
    else:
        print("ChatGPT provided no-code solution for the given prompt.")

Keywords found in summary are:  ['write', 'script', 'generate']


Did ChatGPT provide code or no-code solution:
ChatGPT provided code for the given prompt.


### Considering only English values because summarizing non-English values is difficult

In [74]:
def is_english(text):
    """
    Check if the given text is a string and contains only English characters (including punctuation).
    """
    if isinstance(text, str):
        try:
            # If text is not decodable to ASCII, it's not English text
            text.encode(encoding='utf-8').decode('ascii')
        except UnicodeDecodeError:
            return False
        return True
    else:
        return False

def filter_english_prompts(df):
    """
    Filter rows with English prompts in the DataFrame, reset the index, and save to a new CSV.
    """
    # Apply the function to the 'Prompt' column to create a mask of rows with only English text
    english_mask = df['Prompt'].apply(is_english)

    # Extract the rows where the mask is True
    english_data = df[english_mask]

    # Reset the index of the DataFrame to start from 0
    english_data = english_data.reset_index(drop=True)

    return english_data

# Call the function with the DataFrame
english_df = filter_english_prompts(conversations_data)
english_df

Unnamed: 0,Prompt,Answer,ListOfCode
0,You are an expert search query generator.\n\nI...,"ChatGPTCertainly, here are 10 high-quality sea...","[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '..."
1,https://github.com/arakoodev/EdgeChains,ChatGPTFinished browsingShow workHide workEdge...,[]
2,can this be used from python:\nhttps://github....,"ChatGPTYes, EdgeChains can be used from Python...",[]
3,give me an example,ChatGPTI do not have an example. You may find ...,[]
4,Write me python3 script that takes in mp3 audi...,ChatGPTCreating a beautiful audio visualizer v...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '..."
...,...,...,...
1486,continue,ChatGPTUsed WebPilotThe version of the FSRS tu...,[]
1487,proceed,ChatGPTUsed WebPilotThe version of the FSRS tu...,[]
1488,I wish that in typescript I could mark a funct...,ChatGPTAs of my knowledge cutoff in September ...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '..."
1489,Are there no other options that use static ana...,ChatGPTAs of my knowledge cut-off in September...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '..."


### Generating summaries for all the prompts in the English DF

In [66]:
prompts = []
summaries = []
for i in range(len(filtered_file_path["Prompt"])):
    prompts.append(filtered_file_path["Prompt"][i])
for i in prompts:
    print(generate_summary(i))

In [79]:
summaries = []

# Generate summaries for each prompt
for prompt in prompts:
    try:
        summary = generate_summary(prompt)
    except ValueError:
        # If the error occurs, append a placeholder text
        summary = "no data"
    summaries.append(summary)

# Create a DataFrame from the summaries
summaries_df = pd.DataFrame(summaries, columns=['Summary'])

final_df = filtered_file_path.join(summaries_df)

### Checking how many values in summary are "no data" - write a little about error handling

In [80]:
count = 0
for i in summaries_df["Summary"]:
    if i == "no data":
        count+=1
print(count) 

13


### checking code or no-code solution based on keywords present

In [87]:
final_df['Found_Keywords'] = final_df['Summary'].apply(lambda x: [kw for kw in keywords if kw in x])

In [94]:
final_df['Solution_Type'] = final_df.apply(lambda row: 'no-code' if not row['ListOfCode'] else 'code', axis=1)

In [99]:
def determine_solution_type(row):
    if row['Found_Keywords'] and any(keyword in ' '.join(map(str, row['ListOfCode'])).lower() 
                                     for keyword in keywords):
        return 'code'
    else:
        return 'no-code'

final_df['Solution_Type'] = final_df.apply(determine_solution_type, axis=1)

In [98]:
final_df

Unnamed: 0,Prompt,Answer,ListOfCode,Summary,Found_Keywords,Solution_Type
0,You are an expert search query generator.\n\nI...,"ChatGPTCertainly, here are 10 high-quality sea...","[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",generate high quality search queries based pro...,[generate],code
1,https://github.com/arakoodev/EdgeChains,ChatGPTFinished browsingShow workHide workEdge...,[],https,[],no-code
2,can this be used from python:\nhttps://github....,"ChatGPTYes, EdgeChains can be used from Python...",[],used python https,[],no-code
3,give me an example,ChatGPTI do not have an example. You may find ...,[],give example,[example],no-code
4,Write me python3 script that takes in mp3 audi...,ChatGPTCreating a beautiful audio visualizer v...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",write python3 script takes mp3 audio track gen...,"[write, script, generate]",code
...,...,...,...,...,...,...
1486,continue,ChatGPTUsed WebPilotThe version of the FSRS tu...,[],continue,[],no-code
1487,proceed,ChatGPTUsed WebPilotThe version of the FSRS tu...,[],proceed,[],no-code
1488,I wish that in typescript I could mark a funct...,ChatGPTAs of my knowledge cutoff in September ...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",wish typescript could mark function throws cal...,"[script, function, error, exception]",code
1489,Are there no other options that use static ana...,ChatGPTAs of my knowledge cut-off in September...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",someting like adding comment top typescript fu...,"[script, function]",code


### EVERYTHING TOGETHER

# summary functions

In [100]:
def read_and_preprocess_text(text):
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)

    # Tokenize each sentence into words, remove stopwords and punctuation, and convert to lowercase
    stop_words = set(nltk.corpus.stopwords.words('english') + list(string.punctuation))
    tokenized_sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence) if word.isalnum() and word.lower() not in stop_words]
        for sentence in sentences
    ]

    return tokenized_sentences

def calculate_cosine_similarity(tfidf_matrix):
    # Calculate cosine similarity between sentences based on TF-IDF matrix
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return similarity_matrix

def generate_summary(text, num_sentences=3):
    # Preprocess the text
    tokenized_sentences = read_and_preprocess_text(text)

    # Convert sentences to TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([' '.join(sentence) for sentence in tokenized_sentences])

    # Calculate cosine similarity matrix
    similarity_matrix = calculate_cosine_similarity(tfidf_matrix)

    # Use the PageRank algorithm to rank sentences
    graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(graph)

    # Sort sentences by their scores and select the top ones
    ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(tokenized_sentences)), reverse=True)
    summary_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]

    # Join the summary sentences to form the final summary
    summary = ' '.join([' '.join(sentence) for sentence in summary_sentences])

    return summary

In [None]:
data = 