# Linguistics analysis for different age groups

### High-level Steps:
1. Performing raw data analysis, manual and WordCloud
2. Preprocessing the data (removal of < 18yr)
3. Feature engineering using syntactic parsing
4. Syntactic feature comparison with GPT4 output text, using openai API
5. Forecasting age_group using multiple modeling techniques for understanding the data stochasticity or pattern recognition

## OpenAI api over gpt-4 model: Comparative text generation and analysis

#### GPT4's text generation

In [None]:
import os
from openai import OpenAI
import pandas as pd
import random

# Set up your OpenAI API key
openai_api_key = os.getenv('OPENAI_API_KEY')
if openai_api_key is None:
    raise ValueError("OPENAI_API_KEY environment variable is not set.")
client = OpenAI(api_key=openai_api_key)


# Define the age group function
def define_age_group(age):
    if age >= 18 and age <= 27:
        return 'Young'
    elif age >= 28 and age <= 37:
        return 'Middle Aged'
    else:
        return 'Old'

# Function to generate text using OpenAI API
def generate_text(age_group, topic):
    # Define the prompt based on the requested age group and topic
    prompt = f"Sample text of max 20 words if you were of age '{age_group}' on Topic: {topic}."
    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": prompt},
        ], 
        temperature=0.5,
        max_tokens=150
    )
    return completion.choices[0].message.content.strip()

# Generate a DataFrame with 'text', 'age', 'age_group', and 'topic' columns
def generate_data(num_samples):
    data = {'text': [], 'age': [], 'age_group': [], 'topic': []}

    for _ in range(num_samples):
        # Generate a random age (you can replace this with your own age generation logic)
        age = random.randint(18, 60)
        
        # Define age group using the provided function
        age_group = define_age_group(age)

        # Generate a random topic (you can replace this with your own topic generation logic)
        topics = ['InvestmentBanking', 'indUnk', 'Non-Profit', 'Banking',
       'Education', 'Engineering', 'Communications-Media',
       'BusinessServices', 'Internet', 'Museums-Libraries', 'Accounting',
       'Science', 'Student', 'Technology', 'Arts', 'Law', 'Consulting',
       'Automotive', 'Religion', 'Fashion', 'Sports-Recreation',
       'Publishing', 'Marketing', 'LawEnforcement-Security',
       'HumanResources', 'Telecommunications', 'Military', 'Government',
       'Transportation', 'Architecture', 'Advertising', 'Biotech',
       'RealEstate', 'Manufacturing', 'Construction', 'Chemicals',
       'Maritime', 'Agriculture', 'Tourism', 'Environment']
        topic = random.choice(topics)
        
        # Use the OpenAI API to generate text based on the age group and topic
        generated_text = generate_text(age_group, topic)
        
        # Append data to the dictionary
        data['text'].append(generated_text)
        data['age'].append(age)
        data['age_group'].append(age_group)
        data['topic'].append(topic)

    # Create a DataFrame
    df = pd.DataFrame(data)
    return df

# Generate 100 samples
num_samples = 1000
generated_data_1000 = generate_data(num_samples)

In [163]:
generated_data_1000.drop('age', axis=1)

Unnamed: 0,text,age_group,topic
0,The architectural grandeur of ancient structur...,Old,Architecture
1,"""Over the years, the publishing industry has e...",Old,Publishing
2,"Fashion isn't just about trends, it's about ex...",Old,Fashion
3,"""I've always been fascinated by maritime histo...",Middle Aged,Maritime
4,"""Expanding my business services has significan...",Middle Aged,BusinessServices
...,...,...,...
995,"""Maritime traditions have greatly evolved sinc...",Old,Maritime
996,"In my many years of experience, I've found tha...",Old,Accounting
997,"""Advertising has drastically evolved with the ...",Middle Aged,Advertising
998,"""I thoroughly enjoy the tranquility and wealth...",Middle Aged,Museums-Libraries


In [166]:
generated_data_1000.to_csv('new_text_data_1000.csv', index=False)

In [200]:
print(common_columns)

['Young', 'Middle Aged', 'Old']
