SmallLanguageModel

This repository contains all the necessary items needed to build your own LLM from scratch. Just follow the instructions. Inspired from Karpathy's nanoGPT and Shakespeare generator, I made this repository to build my own LLM. It has everything from data collection for the Model to architecture file, tokenizer and train file.

Repo Structure

This repo contains:

Data Collector: Web-Scrapper containing directory, in case you want to gather the data from scratch instead of downloading.
Data Processing: Directory that contains code to pre-process certain kinds of file like converting parquet files to .txt and .csv files and file appending codes.
Models: Contains all the necessary code to train a model of your own. A BERT model, GPT model & Seq-2-Seq model along with tokenizer and run files.

Prerequisites

Before setting up SmallLanguageModel, ensure that you have the following prerequisites installed:

Python 3.8 or higher
pip (Python package installer)

How to use:

Follow these steps to train your own tokenizer or generate outputs from the trained model:

Clone this repository:

git clone https://github.com/shivendrra/SmallLanguageModel-project
cd SLM-clone

Install Dependencies:
```
pip install requirements.txt
```
Train: Read the training.md for more information. Follow it.

StarHistory

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

License

MIT License. Check out License.md for more info.

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
Data Collection		Data Collection
Data Processing		Data Processing
Data		Data
Models		Models
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
null.png		null.png
requirements.txt		requirements.txt
training.md		training.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Collection

Data Collection

Data Processing

Data Processing

Data

Data

Models

Models

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

null.png

null.png

requirements.txt

requirements.txt

training.md

training.md

Repository files navigation

SmallLanguageModel

Repo Structure

Prerequisites

How to use:

StarHistory

Contributing

License

About

Releases

Packages

Languages

License

shivendrra/SmallLanguageModel-project

Folders and files

Latest commit

History

Repository files navigation

SmallLanguageModel

Repo Structure

Prerequisites

How to use:

StarHistory

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages