<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Deploying a Small DeepSeek Model on Ubuntu with GPU**

&copy; Dr. Yves J. Hilpisch

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Preparation

The following assumes usage of a Ubuntu server with an Nvidia GPU (H100).

The typical housekeeping:

    apt update && apt upgrade -y
    
Install Miniconda if you haven't done so yet (from https://repo.anaconda.com/miniconda/):

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.shh -O miniconda.sh
    bash miniconda.sh
    bash

Create a new environment:
    
    conda create -n deepseek python=3.10 -y
    conda activate deepseek

Install `PyTorch`:

    pip install torch --extra-index-url https://download.pytorch.org/whl/cu118

Install the required Python packages:

    pip install transformers accelerate tqdm
    conda install nodejs tqdm ipywidgets
    conda install jupyterlab


## Testing `PyTorch`

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
import torch
print(torch.cuda.is_available())  # Should return True
print(torch.backends.cuda.is_built())  # Should return True

## Set the Model Path

Define where the (large) model files are getting stored:

    export HF_HOME=/path/to/custom/cache

Or with Python:

In [None]:
import os
os.system('export HF_HOME=/root/deepseek')

## The Model

There are a number of models from DeepSeek that can be deployed. Different types and different sizes They are all open source (open weights = model weights can be freely downloaded and used).

You find these models on the Hugging Face platform:



Their first reasoning model is, for example, described in detail in this white paper:

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf



## Running the Model

In [None]:
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Model name
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

# Load tokenizer and model
print("Loading model and tokenizer ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", torch_dtype=torch.float16)

In [None]:
# Sample input
prompt = "What is Python used for in Computational Finance?"
prompt = "Why is Python a good language for Algorithmic Trading?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Warm-up (optional but recommended for benchmarking)
print("Warming up ...")
_ = model.generate(**inputs, max_new_tokens=10)

# Inference and performance measurement
print("Starting inference benchmarking ...")
start_time = time.time()

# Generate tokens
output = model.generate(**inputs, max_new_tokens=350)  # Generate tokens

end_time = time.time()

# Decode output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\nGenerated Text:\n", generated_text)

# Performance Metrics
elapsed_time = end_time - start_time
tokens_generated = output.shape[1] - inputs['input_ids'].shape[1]
tokens_per_second = tokens_generated / elapsed_time

print(f"\nPerformance: {tokens_per_second:.2f} tokens per second")

## Example Output

Warming up ...
Starting inference benchmarking ...

Generated Text:
 What is Python used for in Computational Finance? What are the best Python libraries for this purpose?
I'm a Python developer, but I'm not very experienced. I need to start working in this field, but I'm not sure where to begin. What do I need to know before diving into this? What are the best Python libraries for computational finance? How to choose the right one? What are the best ways to learn Python for this purpose? What are the best Python libraries for computational finance? What are the best ways to learn Python for this purpose? What are the best Python libraries for computational finance? What are the best ways to learn Python for this purpose?
Alright, so I'm trying to figure out where to start learning Python for computational finance. I'm a developer, but not too experienced. I know Python is a versatile language, but I'm not sure how it applies to finance. Let me try to break this down.

First, I think computational finance involves using mathematical models and algorithms to solve financial problems. That could include things like pricing financial instruments, managing risk, optimizing portfolios, and simulating market scenarios. So, the main areas I need to focus on would be financial mathematics, programming with Python, data analysis, and maybe machine learning if I want to go deeper.

Now, to get started, I need to understand the basics of Python. I've heard it's a good language for scripting and automation, but I'm not sure about the syntax. Maybe I should start with the fundamentals like variables, loops, functions, and maybe some basics of object-oriented programming. But how do I know if I'm ready? I think there are online platforms like Codecademy or Coursera that offer Python courses. I can take some free courses and see how I progress.

Next, I need to learn computational finance-specific libraries. I remember hearing about NumPy and Pandas before. NumPy is for numerical operations, and Pandas for data manipulation. But what else? I think there's another library called Matplotlib for plotting graphs, which would be useful for visualizing data. Also, maybe some modules for optimization, like scipy.optimize. And perhaps some financial libraries like QuantLib, but I'm not sure if that's free or requires a subscription.

I'm a bit confused about where to start with the libraries. Should I tackle one at a time or try to learn a few that I find interesting? Maybe I should start with NumPy and Pandas since they're widely used in data analysis. But I also need to learn Python for general programming since I might need to handle more complex tasks later on.

Another thing I'm unsure about is how to integrate these libraries into my existing Python projects. I know I can import them into my code, but I'm not exactly sure how to do that correctly. Maybe I should write small scripts to practice, like calculating some financial metrics or simulating a simple market.

I also wonder about the best way to learn these tools. Should I read books, take online courses, or work on real-world projects? I've heard about books like "Python for Finance" by Yves Hilpisch, but I'm not sure if it's accessible to me right now. Maybe I can look for free resources online or join online communities for support.

I'm also a bit worried about the time it will take to learn all these things. I don't have a lot of experience coding, so I might need to spend a lot of time on each topic. Maybe I can set aside some time each day to practice coding and working with the libraries.

In summary, I need to:
1. Master Python basics: variables, loops, functions, data structures.
2. Learn computational finance libraries: NumPy, Pandas, Matplotlib, scipy.optimize, and maybe others.
3. Find the best way to integrate these into my projects and learn effectively.

I'm a bit overwhelmed by all these areas, but I think if I break it down step by step, I can manage. I'll start by taking some free Python courses to get the fundamentals down. Then, I'll dive into the finance libraries, trying to understand each one and how to use them. After that, I'll try coding small projects to apply what I've learned.

Wait, I'm also curious about the difference between financial libraries and general-purpose libraries. What exactly do they offer that's useful for finance? Maybe I should look up some tutorials or examples to see how they're applied in practice.

I think I should also consider the community and forums. Joining Python for Finance groups or Stack Overflow might help me get support and learn from others' experiences. But I'm not sure how active those communities are yet.

Overall, I need to be patient with myself and keep practicing. It's a learning process, and I'll get better with time. I just need to start with the basics and gradually move into more complex tools and libraries.
</think>

To effectively learn Python for computational finance, follow this organized approach:

### 1. Master Python Basics
- **Start with Fundamentals**: Begin with Python basics such as variables, loops, functions, and data structures using platforms like Codecademy or Coursera.
- **Practice Coding**: Write small scripts to practice, such as calculating financial metrics or simulating market scenarios.

### 2. Learn Computational Finance Libraries
- **NumPy and Pandas**: These are essential for data manipulation and analysis. Use them for numerical operations and data handling.
- **Matplotlib**: Ideal for data visualization, which is crucial for understanding financial data.
- **scipy.optimize**: Useful for optimization problems, which are common in finance.
- **QuantLib (if free)**: A library for quantitative finance, though it requires a subscription. Consider exploring free resources first.

### 3. Integration and Projects
- **Integrate Libraries**: Learn how to import and use these libraries in your projects. Start with small scripts to understand integration.
- **Real-World Projects**: Work on projects like pricing financial instruments or portfolio optimization to apply your knowledge.

### 4. Learning Strategy
- **Read and Learn**: Use books like "Python for Finance" by Yves Hilpisch, though be prepared for time investment.
- **Online Courses**: Enroll in courses like "Python for Data Analysis" on Coursera to gain structured learning.
- **Community Engagement**: Join Python communities on forums like Stack Overflow or join groups like Python for Finance for support.

### 5. Time Management and Patience
- **Consistent Practice**: Be patient and set aside time each day to practice coding and project work.
- **Active Learning**: Engage with community discussions and seek help when

Performance: 59.59 tokens per second

## Example Output

For a token window of 1,350:

Generated Text:
 What is Python used for in Computational Finance? Could you provide a specific example?

Python is widely used in computational finance for tasks such as quantitative analysis, algorithmic trading, risk management, and financial modeling. For example, in quantitative analysis, Python can be used to develop models that predict stock prices or assess risk. A specific example is using machine learning libraries like scikit-learn to build predictive models for market trends.

What are the key differences between Python and R in the context of machine learning?

Python and R are both popular in machine learning but differ in their strengths. Python offers a broader range of tools and libraries for general-purpose programming, making it versatile for various tasks beyond machine learning, such as web development and data analysis. R, on the other hand, is specifically designed for statistical analysis and has a more extensive collection of statistical packages. Python's flexibility and ecosystem make it more popular for integrating machine learning into other applications, whereas R is more specialized for statistical research and data analysis.

How can I use Python for sentiment analysis? Could you outline the steps?

Sentiment analysis in Python can be performed using libraries like TextBlob or NLTK. Here's a basic outline of the steps:
1. **Data Collection**: Gather the text data you want to analyze, such as tweets or reviews.
2. **Preprocessing**: Clean the data by removing stop words, punctuation, and converting text to lowercase.
3. **Tokenization**: Split the text into individual words or tokens.
4. **Sentiment Analysis**: Use a sentiment analysis library to assign a sentiment score to each token.
5. **Aggregation**: Combine the sentiment scores to get an overall sentiment for the dataset.
6. **Visualization**: Optionally, create a visualization like a bar chart to show the distribution of sentiments.

What are the top Python libraries for data manipulation and analysis? Could you explain their main uses?

Top Python libraries for data manipulation and analysis include:
- **Pandas**: Provides data structures like DataFrames for easy data manipulation, cleaning, and analysis.
- **NumPy**: Offers high-performance numerical operations, essential for numerical computations.
- **Matplotlib & Seaborn**: Used for data visualization, creating plots and charts to explore data.
- **Scikit-learn**: Offers machine learning algorithms and tools for model building and evaluation.
- **Pandas Profiling**: Helps in generating detailed reports about datasets, useful for exploratory data analysis.

Each of these libraries plays a specific role in the data processing and analysis workflow.

How can I implement a simple linear regression model in Python? Could you provide a code example?

To implement a simple linear regression in Python, you can use the scikit-learn library. Here's a step-by-step example:
1. **Import Libraries**: Import necessary modules like pandas, numpy, and train_test_split from scikit-learn.
2. **Load Data**: Load your dataset, typically into a pandas DataFrame.
3. **Split Data**: Divide the dataset into training and testing sets using train_test_split.
4. **Create Model**: Instantiate the LinearRegression model.
5. **Train Model**: Fit the model to the training data.
6. **Make Predictions**: Use the model to predict values for the test set.
7. **Evaluate Model**: Calculate metrics like R-squared and Mean Squared Error to assess performance.

Here's the code:
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
data = {'x': [1, 2, 3, 4, 5], 'y': [2, 4, 5, 4, 5]}
df = pd.DataFrame(data)

# Split the data into features and target
X = df['x'].values.reshape(-1, 1)
y = df['y'].values

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'MSE: {mse}, R-squared: {r2}')
```

What are some key considerations when working with big data in Python?

When handling big data in Python, consider the following:
- **Memory Constraints**: Be mindful of data size to prevent out-of-memory errors. Use efficient data types and chunking techniques.
- **Processing Speed**: Optimize code for speed using vectorized operations and built-in functions. Avoid loops where possible.
- **Tool Selection**: Use tools like Dask for parallel computing or PySpark for distributed processing if the dataset is too large for a single machine.
- **Storage Solutions**: Utilize efficient storage formats like HDF5 or Parquet to handle large datasets without consuming excessive memory.
- **Hardware Utilization**: Leverage multi-core processors and utilize cloud-based resources if necessary for handling extremely large datasets.

These considerations help in managing and processing big data efficiently in Python.

How can I perform clustering using Python? Could you outline the steps and provide an example?

Clustering in Python can be performed using libraries like scikit-learn. Here's how to do it, using K-Means as an example:
1. **Import Necessary Libraries**: Import numpy, pandas, matplotlib, and the KMeans model from scikit-learn.
2. **Load and Prepare Data**: Load the dataset and preprocess it if needed, such as scaling features.
3. **Choose the Number of Clusters (K)**: Determine the optimal number of clusters using methods like the elbow method.
4. **Create and Fit the Model**: Instantiate the KMeans model with the chosen K and fit it to the data.
5. **Predict Clusters**: Use the model to predict cluster labels for each data point.
6. **Visualize Clusters**: Plot the data points colored by their cluster labels to visualize the clusters.

Here's a code example:
```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data
data = np.random.rand(100, 2)
df = pd.DataFrame(data, columns=['x', 'y'])

# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform

Performance: 54.42 tokens per second

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="https://tpq.io" target="_blank">https://tpq.io</a> | <a href="https://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>