<a href="https://colab.research.google.com/github/tiagochavo87/LCSH_analysis/blob/main/analysis_of_cma_lcshs_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [98]:
import pandas as pd

# Read the data from the CSV file
data = pd.read_csv("LOHx.csv", sep=";")

# Apply a filter to exclude chromosomes X and Y
data = data.loc[(data['Chrom'] != 'X') & (data['Chrom'] != 'Y')]

# Replace dots and convert Tamanho column to integer type
data['Tamanho'] = data['Tamanho'].str.replace('.', '').astype(int)

# Filter data for autosomes with a size greater than 10,000,000
data_autosomes = data.loc[data['Chrom'].str.contains('^([1-9]|[1-9][0-9])$')]
data_autosomes = data_autosomes[data_autosomes['Tamanho'] > 10000000]

# Write the autosomes data to a new CSV file
data_autosomes.to_csv('LCSHs_>_10.csv', sep=';', index=False)

# Filter data for regions with size greater than or equal to 3,000,000
data = data[data['Tamanho'] >= 3000000]

# Filter data for regions with sizes between 3,000,000 and 5,000,000
data_3_5 = data[(data['Tamanho'] > 3000000) & (data['Tamanho'] <= 5000000)]

# Write the 3-5 MB regions data to a new CSV file
data_3_5.to_csv('LCSHs_>_3_<_5.csv', sep=';', index=False)

# Filter data for 3-5 MB regions with a total size of at least 10,000,000 and only one region per file
data_3_5 = data_3_5.groupby('File').filter(lambda x: len(x) == 1 and x['Tamanho'].sum() >= 10000000)

# Filter data for regions with size greater than 5,000,000
data_5 = data[data['Tamanho'] > 5000000]

# Filter data for regions with a total size of at least 10,000,000 and only one chromosome per file
data_5 = data_5.groupby('File').filter(lambda x: len(x.groupby('Chrom')) == 1 and x['Tamanho'].sum() >= 10000000)

# Write the 5 MB regions data to a new CSV file
data_5.to_csv('possibles_UPDs.csv', sep=';', index=False)

# Filter data for cases with at least one LCSH greater than 5 MB
data_5_mb = data[data['Tamanho'] > 5000000]
data_5_mb = data_5_mb.groupby('File').filter(lambda x: x[x['Tamanho'] >= 5000000].shape[0] > 0)

# Print all cases with LCSHs greater than 5 MB to a CSV file
data_5_mb.to_csv('LCSHs_>_5.csv', sep=';', index=False)

# Sum LCSHs greater than 3 MB for each case
sum_lcshs = data[data['Tamanho'] >= 3000000].groupby('File')['Tamanho'].sum()

# Create a new DataFrame with case names and total LCSHs greater than 3 MB
output = pd.DataFrame({'File': sum_lcshs.index, 'Total LCSHs > 3MB': sum_lcshs.values})

# Write the output DataFrame to a CSV file
output.to_csv('total_lcshs.csv', sep=';', index=False)

# Calculate the total size of human autosomal DNA
total_autosomal_dna = 2881000000

# Calculate the proportion of LCSHs greater than 3 MB for each case
proportions = []
for file, size in sum_lcshs.items():
    proportion = size / total_autosomal_dna
    proportions.append({'File': file, 'Total LCSHs > 3MB': size, 'Proportion of LCSHs > 3MB to Autosomal DNA': proportion})

# Create a new DataFrame with the proportion for each case
proportions_df = pd.DataFrame(proportions)

# Write the output DataFrame to a CSV file
proportions_df.to_csv('lcsh_proportions.csv', sep=';', index=False)


  data['Tamanho'] = data['Tamanho'].str.replace('.', '').astype(int)
  data_autosomes = data.loc[data['Chrom'].str.contains('^([1-9]|[1-9][0-9])$')]


This code imports the NumPy and Pandas libraries and uses them to calculate inbreeding coefficients based on certain proportions of data.

The first step is to define an array of values for F (inbreeding coefficient) that the code will use. These values are defined as percentages and then divided by 100.

Next, the code reads in a CSV file containing proportions of data and stores them in a Pandas DataFrame.

Then, the code iterates through each row of the DataFrame and calculates the F value that is closest to the proportion of LCSHs (Library of Congress Subject Headings) greater than 3MB to Autosomal DNA. This is done using NumPy's argmin() function to find the index of the closest F value in the array.

After calculating the closest F value for each row, the code creates a new Pandas DataFrame with the data transposed so that the F values are columns and the file names are rows.

Finally, the transposed DataFrame is written to a new CSV file along with the original proportions DataFrame.

Overall, this code is useful for calculating inbreeding coefficients based on specific proportions of data, which can be helpful in genetic research.

Output:
There are two output files: 'lcsh_transposed_proportions.csv' and 'inbreeding_coefficients.csv'. The first file contains the transposed proportions DataFrame, and the second file contains the original proportions DataFrame with an additional column for the closest F value.






In [99]:
import numpy as np

# Define the F values
f_values = np.array([0.5, 1.5, 3, 6, 12.5, 25]) / 100

# Get the data proportions
proportions = pd.read_csv('lcsh_proportions.csv', sep=';')

# Calculate the F values closest to the proportions
for index, row in proportions.iterrows():
    proportion = row['Proportion of LCSHs > 3MB to Autosomal DNA']
    closest_f = f_values[np.abs(f_values - proportion).argmin()]
    proportions.at[index, 'Inbreeding Coefficient (F)'] = closest_f

# Transpose the cases to the columns corresponding to the F values
transposed_proportions = pd.DataFrame(columns=['File', '0.5%', '1.5%', '3%', '6%', '12.5%', '25%'])
for index, row in proportions.iterrows():
    file = row['File']
    f_value = row['Inbreeding Coefficient (F)']
    transposed_proportions.at[index, 'File'] = file
    transposed_proportions.at[index, f'{f_value*100:.1f}%'] = row['Proportion of LCSHs > 3MB to Autosomal DNA']

# Write the transposed DataFrame to a new CSV file
transposed_proportions.to_csv('lcsh_transposed_proportions.csv', sep=';', index=False)

# Write the output DataFrame to a CSV file
proportions.to_csv('inbreeding_coefficients.csv', sep=';', index=False)


In [100]:
import pandas as pd

# Carrega o arquivo CSV
df = pd.read_csv("LCSHs_>_3_<_5.csv", sep=";")
# Extrai as informações da coluna "Microarray Nome.."
df['start'] = df['Microarray Nome..'].str.extract('\((.*?)\-')
df['end'] = df['Microarray Nome..'].str.extract('\-(.*?)\)')

def extract_coordinates(row):
    pattern = r'\((\d+,?\d*)-(\d+,?\d*)\)'
    match = re.search(pattern, row)
    if match:
        start = match.group(1).replace(',', '.')
        end = match.group(2).replace(',', '.')
        return pd.Series([start, end])
    else:
        return pd.Series(['', ''])

# Remove a coluna original
df.drop('Microarray Nome..', axis=1, inplace=True)

# Cria uma nova dataframe com as colunas "Chrom", "cytoband", "start" e "end"
new_df = pd.DataFrame({
    'Chrom': df['Chrom'],
    'cytoband': df['cytoband'],
    'start': df['start'],
    'end': df['end']
})

def extract_coordinates(row):
    pattern = r'\((\d+,?\d*)-(\d+,?\d*)\)'
    match = re.search(pattern, row)
    if match:
        start = match.group(1).replace(',', '.')
        end = match.group(2).replace(',', '.')
        return pd.Series([start, end])
    else:
        return pd.Series(['', ''])

new_df['start'] = new_df['start'].replace(',', '', regex=True)
new_df['end'] = new_df['end'].replace(',', '', regex=True)

# Salva a nova dataframe em um arquivo CSV
new_df.to_csv('nova_dataframe_fron.csv', sep=";", index=False, decimal='.')



In [101]:
import pandas as pd

# Ler o arquivo CSV
data = pd.read_csv('LOHx.csv', sep=';')

# Contar a quantidade de valores únicos na coluna "File"
file_count = data['File'].nunique()

# Imprimir o total de casos "File"
print(f"O total de casos na coluna 'File' é: {file_count}")

# Armazenar o resultado em uma variável
result0 = pd.DataFrame([file_count])

# Salvar o resultado em um arquivo CSV
result0.to_csv('output.csv', sep=';', index=False)

# Imprimir o resultado
print(result0)


O total de casos na coluna 'File' é: 917
     0
0  917


In [102]:
import pandas as pd

# Ler arquivo de entrada (assumindo que esteja em formato csv)
df = pd.read_csv('nova_dataframe_fron.csv', sep=";")
result = 917

# Agrupar por cromossomo e citobanda, incluindo o número de cromossomo na agregação
grouped_df = df.groupby(['Chrom', 'cytoband']).agg({'start': 'median', 'end': 'median', 'Chrom': ['count', 'first']})

# Renomear colunas
grouped_df.columns = ['start', 'end', 'frequency', 'Chrom']

# Normalizar frequência em relação a um valor de referência
grouped_df['normalized_frequency'] = grouped_df['frequency'] / result

# Salvar resultado em um novo arquivo csv
grouped_df.to_csv('resultado.csv', index=False)

# Encontrar casos com frequência absoluta maior que 5%
frequencia_absoluta = grouped_df[grouped_df['frequency'] >= 0.005 * len(df)]

# Salvar casos em um novo arquivo csv
frequencia_absoluta.to_csv('frequencia_absoluta.csv', sep=";", index=True, decimal='.')


In [110]:
import pandas as pd

# Ler arquivo de entrada (assumindo que esteja em formato csv)
df = pd.read_csv('nova_dataframe_fron.csv', sep=";")
result = 917

# Agrupar por cromossomo e citobanda, incluindo o número de cromossomo na agregação
grouped_df = df.groupby(['Chrom', 'cytoband']).agg({'start': 'median', 'end': 'median', 'Chrom': ['count', 'first']})

# Renomear colunas
grouped_df.columns = ['start', 'end', 'frequency', 'Chrom']

# Normalizar frequência em relação a um valor de referência
grouped_df['normalized_frequency'] = grouped_df['frequency'] / result

# Formatar as colunas 'start' e 'end'
grouped_df[['start', 'end']] = grouped_df[['start', 'end']].astype(int).applymap('{:09d}'.format)


# Salvar resultado em um novo arquivo csv
grouped_df.to_csv('resultado.csv', index=False)

# Encontrar casos com frequência absoluta maior que 5%
frequencia_absoluta = grouped_df[grouped_df['frequency'] >= 0.005 * len(df)]
# Arredondar colunas start, end e normalized_frequency
frequencia_absoluta['normalized_frequency'] = frequencia_absoluta['normalized_frequency'].round(decimals=3)

# Salvar casos em um novo arquivo csv
frequencia_absoluta.to_csv('frequencia_absoluta.csv', sep=";", index=True, decimal='.')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frequencia_absoluta['normalized_frequency'] = frequencia_absoluta['normalized_frequency'].round(decimals=3)


In [115]:

import pandas as pd

# Ler arquivo de entrada (assumindo que esteja em formato csv)
df = pd.read_csv('nova_dataframe_fron.csv', sep=";")
result = 917

# Agrupar por cromossomo e citobanda, incluindo o número de cromossomo na agregação
grouped_df = df.groupby(['Chrom', 'cytoband']).agg({'start': 'median', 'end': 'median', 'Chrom': ['count', 'first']})

# Renomear colunas
grouped_df.columns = ['start', 'end', 'frequency', 'Chrom']

# Normalizar frequência em relação a um valor de referência
grouped_df['normalized_frequency'] = grouped_df['frequency'] / result

# Formatar as colunas 'start' e 'end'
grouped_df[['start', 'end']] = grouped_df[['start', 'end']].astype(int).applymap('{:09d}'.format)

# Calcular tamanho em Kilobases de base
grouped_df['Size'] = ((grouped_df['end'].astype(int) - grouped_df['start'].astype(int)) / 1000).astype(int)

# Salvar resultado em um novo arquivo csv
grouped_df.to_csv('resultado.csv', index=False)

# Encontrar casos com frequência absoluta maior que 5%
frequencia_absoluta = grouped_df[grouped_df['frequency'] >= 0.005 * len(df)]

# Arredondar colunas start, end e normalized_frequency
frequencia_absoluta['normalized_frequency'] = frequencia_absoluta['normalized_frequency'].round(decimals=3)

# Salvar casos em um novo arquivo csv
frequencia_absoluta.to_csv('frequencia_absoluta.csv', sep=";", index=True, decimal='.')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frequencia_absoluta['normalized_frequency'] = frequencia_absoluta['normalized_frequency'].round(decimals=3)


In [116]:
print(frequencia_absoluta)

                    start        end  frequency  Chrom  normalized_frequency  \
Chrom cytoband                                                                 
1     p33       049149495  053138197         82      1                 0.089   
      q21.1     144077594  148750533         52      1                 0.057   
      q21.2     147268778  150579272         56      1                 0.061   
2     q11.1     095550958  098905554         61      2                 0.067   
      q32.1     187305881  190929980         19      2                 0.021   
      q32.3     193761737  197779433         22      2                 0.024   
3     p21.31    048597552  052514732        120      3                 0.131   
4     p15.1     032356205  035680350         20      4                 0.022   
5     p12       043046912  046313469         21      5                 0.023   
      q23.3     128694241  132201418         45      5                 0.049   
6     p22.2     026290480  029543646    