Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

larger raw file for testing adjust_rub() function #3

Closed
epogrebnyak opened this issue Jul 5, 2019 · 4 comments
Closed

larger raw file for testing adjust_rub() function #3

epogrebnyak opened this issue Jul 5, 2019 · 4 comments
Labels
enhancement New feature or request testing
Milestone

Comments

@epogrebnyak
Copy link
Collaborator

epogrebnyak commented Jul 5, 2019

Сделать более крупный файл для теста с помощью csvki (например, 10+ 10+ 50 рядов)

@epogrebnyak epogrebnyak mentioned this issue Aug 6, 2019
31 tasks
@epogrebnyak epogrebnyak changed the title test suite larger raw file for testing adjust_rub() function Aug 19, 2019
@epogrebnyak epogrebnyak added this to the 0.1 milestone Aug 19, 2019
@epogrebnyak epogrebnyak added the enhancement New feature or request label Aug 19, 2019
@epogrebnyak
Copy link
Collaborator Author

Code to be tested and speed checked:

# FIXME: very slow code, even on small data
# maybe concating is faster?
# billions
#bf = df[df.unit == "385"]
#bf.loc[:,cols] = bf.loc[:, cols].multiply(1000)
#bf.loc[:, "unit"] = "384"
#index = bf.index.tolist()
#
# thousands
#tf = df[df.unit == "383"]
#tf.loc[:,cols] = tf.loc[:, cols].divide(1000).round(0).astype(int)
#tf.loc[:, "unit"] = "384"
# index.extend(rf.index.tolist())
#
# concat
#remains = df[~df.index.isin(index)]
# concat remains, bf, rf
def adjust_rub(df, cols=NUMERIC_COLUMNS):
rows = (df.unit == "385")
df.loc[rows, cols] = df.loc[rows, cols].multiply(1000)
df.loc[rows, "unit"] = "384"
rows = (df.unit == "383")
df.loc[rows, cols] = df.loc[rows, cols].divide(1000).round(0).astype(int)
df.loc[rows, "unit"] = "384"
return df

epogrebnyak added a commit that referenced this issue Aug 21, 2019
epogrebnyak added a commit that referenced this issue Aug 21, 2019
epogrebnyak added a commit that referenced this issue Aug 21, 2019
epogrebnyak added a commit that referenced this issue Aug 21, 2019
@epogrebnyak
Copy link
Collaborator Author

epogrebnyak commented Aug 21, 2019

Code for showing run times:

def canonic_df(df):
    """Преобразовать данные внтури датафрейма:

    - Привести все строки к одинаковым единицам измерения (тыс. руб.)
    - Убрать  неиспользуемые колонки (date_revised, report_type)
    - Новые колонки:
        * короткое название компании
        * три уровня кода ОКВЭД
        * регион (по ИНН)

    """
    df_ = add_okved_subcode(add_region(add_title(df)))
    df_ = rename_rows(df_)
    df_ = adjust_rub(df_)
    return df_[canonic_columns()].set_index('inn')

print("obtaining source...")
root_df0 = boo.main.read_intermediate_df(2017)

print("canonic_df(df)")
df = root_df0.copy()
%timeit canonic_df(df)  

print("columns")
df = root_df0.copy()
%timeit add_okved_subcode(add_region(add_title(df)))

print("adjust rub")
df = root_df0.copy()
%timeit adjust_rub(df)

print("renaming")
df = root_df0.copy()
%timeit rename_rows(df)

@epogrebnyak
Copy link
Collaborator Author

canonic_df(df)
1 loop, best of 3: 18.3 s per loop
columns
1 loop, best of 3: 14.5 s per loop
adjust rub
1 loop, best of 3: 2.08 s per loop
renaming
The slowest run took 4.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 181 ms per loop

@epogrebnyak
Copy link
Collaborator Author

Remaining questions branched to #13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request testing
Projects
None yet
Development

No branches or pull requests

1 participant