# Hindi-English Code-Switching Model Demo

This notebook demonstrates how to use the Hindi-English code-switching model that was fine-tuned on XLM-RoBERTa. The model is hosted on HuggingFace and can be loaded directly without any training.

## Setup

First, let's install the required packages if you haven't already:

In [1]:
# Run this script to install the required packages
%pip install -r requirements.txt

Collecting jupyter>=1.0.0 (from -r requirements.txt (line 3))
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting ipywidgets>=8.0.0 (from -r requirements.txt (line 4))
  Using cached ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting notebook (from jupyter>=1.0.0->-r requirements.txt (line 3))
  Downloading notebook-7.3.3-py3-none-any.whl.metadata (10 kB)
Collecting jupyter-console (from jupyter>=1.0.0->-r requirements.txt (line 3))
  Downloading jupyter_console-6.6.3-py3-none-any.whl.metadata (5.8 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets>=8.0.0->-r requirements.txt (line 4))
  Using cached widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets>=8.0.0->-r requirements.txt (line 4))
  Using cached jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Collecting jupyterlab (from jupyter>=1.0.0->-r requirements.txt (line 3))
  Downloading jupyterlab-4.3.6-py3-none-an

## Loading the Model

Now let's load the model and tokenizer from HuggingFace:

In [2]:
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = AutoModelForMaskedLM.from_pretrained("lord-rajkumar/Code-Switch-Model")

# Create a fill-mask pipeline
# Note: The device will be automatically selected (GPU if available)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# You'll see a message like "Device set to use mps:0" or "Device set to use cuda:0" if you have GPU

# Create zero-shot classification pipeline for demographic analysis
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def classify_demographics(token_str):
    """Classify the demographics of a token"""
    token_str_clean = token_str.strip()
    if not token_str_clean:
        return {"age": "unknown", "region": "unknown"}
    
    # Classify for age
    result_age = classifier(token_str_clean, candidate_labels=["under 30", "over 30"])
    age_label = result_age["labels"][0]
    age_score = result_age["scores"][0]
    
    # Classify for region
    result_region = classifier(token_str_clean, candidate_labels=["urban", "rural"])
    region_label = result_region["labels"][0]
    region_score = result_region["scores"][0]
    
    return {
        "age": age_label,
        "age_confidence": f"{age_score:.2f}",
        "region": region_label,
        "region_confidence": f"{region_score:.2f}"
    }

Device set to use mps:0
Device set to use mps:0


## Testing with Example Sentences

Let's test the model with various code-switched sentences:

In [3]:
# Define example sentences
examples = [
    "<mask>, kya scene hai?",   # Translation: <mask>, what's the scenario?
    "Project pe <mask> progress chal raha hai.", # Translation: <mask> the progress on the project?
    "Hello, <mask> kya kr raha hai?"    # Translation: Hello, <mask> what are you doing?
]

# Process each example
for example in examples:
    print(f"\n=== Input: {example} ===")
    results = fill_mask(example)
    for result in results:
        token = result['token_str']
        score = result['score']
        print(f"\nToken: '{token}', Score: {score:.4f}")
        
        # Perform demographic classification
        demographics = classify_demographics(token)
        print(f"  Demographics: Age likely {demographics['age']} (confidence: {demographics['age_confidence']})")
        print(f"               Region likely {demographics['region']} (confidence: {demographics['region_confidence']})")


=== Input: <mask>, kya scene hai? ===

Token: 'Bhai', Score: 0.1594
  Demographics: Age likely under 30 (confidence: 0.66)
               Region likely rural (confidence: 0.57)

Token: 'Hello', Score: 0.1397
  Demographics: Age likely under 30 (confidence: 0.75)
               Region likely urban (confidence: 0.56)

Token: 'Hi', Score: 0.1270
  Demographics: Age likely under 30 (confidence: 0.67)
               Region likely urban (confidence: 0.59)

Token: 'Sir', Score: 0.0762
  Demographics: Age likely over 30 (confidence: 0.60)
               Region likely urban (confidence: 0.55)

Token: 'Hai', Score: 0.0436
  Demographics: Age likely under 30 (confidence: 0.69)
               Region likely urban (confidence: 0.57)

=== Input: Project pe <mask> progress chal raha hai. ===

Token: 'kya', Score: 0.2187
  Demographics: Age likely under 30 (confidence: 0.72)
               Region likely urban (confidence: 0.62)

Token: 'bahut', Score: 0.1086
  Demographics: Age likely under 30 (confid

## Expected Output

When you run the code above, you should see results similar to these (including demographic analysis):

```
=== Input: <mask>, kya scene hai? ===

Token: 'Bhai', Score: 0.1594
  Demographics: Age likely under 30 (confidence: 0.66)
               Region likely rural (confidence: 0.57)

Token: 'Hello', Score: 0.1397
  Demographics: Age likely under 30 (confidence: 0.75)
               Region likely urban (confidence: 0.56)

Token: 'Hi', Score: 0.1270
  Demographics: Age likely under 30 (confidence: 0.67)
               Region likely urban (confidence: 0.59)

Token: 'Sir', Score: 0.0762
  Demographics: Age likely over 30 (confidence: 0.60)
               Region likely urban (confidence: 0.55)

Token: 'Hai', Score: 0.0436
  Demographics: Age likely under 30 (confidence: 0.69)
               Region likely urban (confidence: 0.57)
```

## Analysis of Demographic Patterns

The zero-shot classification reveals interesting patterns in the model's predictions:

1. **Age patterns**:
   - Most predicted tokens are classified as "under 30", which aligns with the prevalence of code-switching among younger generations
   - Formal terms like "Sir" are classified as "over 30", suggesting formality correlates with older age groups

2. **Regional patterns**:
   - English greetings like "Hello" and "Hi" are classified as more urban
   - Terms like "Bhai" have a higher rural classification than English equivalents
   - Personal names like "Rahul" have a very high urban confidence

3. **Confidence levels**:
   - The model's confidence in age classification is generally higher than in regional classification
   - Most classifications have moderate confidence (0.55-0.75), which is appropriate for this type of analysis

These patterns suggest that code-switching has demographic dimensions that can be captured and analyzed using NLP techniques.

In [4]:
# Try your own examples here with demographic analysis
custom_examples = [
    "<mask> working on this project?",
    "Aaj <mask> plans kya hain?",
    "Meeting mein <mask> discussion hui."
]

for example in custom_examples:
    print(f"\n=== Input: {example} ===")
    results = fill_mask(example)
    for result in results:
        token = result['token_str']
        score = result['score']
        print(f"\nToken: '{token}', Score: {score:.4f}")
        
        # Perform demographic classification
        demographics = classify_demographics(token)
        print(f"  Demographics: Age likely {demographics['age']} (confidence: {demographics['age_confidence']})")
        print(f"               Region likely {demographics['region']} (confidence: {demographics['region_confidence']})")


=== Input: <mask> working on this project? ===

Token: 'Help', Score: 0.0749
  Demographics: Age likely under 30 (confidence: 0.69)
               Region likely urban (confidence: 0.52)

Token: 'Still', Score: 0.0742
  Demographics: Age likely under 30 (confidence: 0.69)
               Region likely rural (confidence: 0.50)

Token: 'Like', Score: 0.0719
  Demographics: Age likely under 30 (confidence: 0.66)
               Region likely urban (confidence: 0.56)

Token: 'Any', Score: 0.0620
  Demographics: Age likely under 30 (confidence: 0.74)
               Region likely urban (confidence: 0.52)

Token: 'Your', Score: 0.0570
  Demographics: Age likely under 30 (confidence: 0.61)
               Region likely urban (confidence: 0.53)

=== Input: Aaj <mask> plans kya hain? ===

Token: 'ka', Score: 0.2887
  Demographics: Age likely under 30 (confidence: 0.64)
               Region likely urban (confidence: 0.55)

Token: 'ke', Score: 0.2329
  Demographics: Age likely under 30 (confidence: 

## Conclusion

This notebook demonstrates the code-switching capabilities of the fine-tuned XLM-RoBERTa model. It shows how the model can predict appropriate Hindi or English words in mixed-language contexts.

The model shows interesting patterns in how it completes sentences with either Hindi or English words depending on the context, capturing natural code-switching behavior observed in multilingual Indian communities.